Hub 2.3.3 is out! Version control upgrade, helper functions, GSOC 2022, and exciting community contributions. Here’s what’s new.
New Hub features
Now you can delete uncommitted changes using ds.reset(). Also, with Hub 2.3.3 you can merge branches and commits using ds.merge(). Copying datasets from one location to another is now possible using hub.copy() and hub.deepcopy() (includes version control history). Metadata from file headers appended using hub.read(fn) is now automatically stored in ds.tensor_name.sample_info.
Community shoutouts
Abid Ali Awan has written a great guide on Hub and the Activeloop Platform for KDnuggets!
Alex Wang has uploaded and documented the KMINST dataset on our Machine Learning Datasets Catalogue.
Manas Gupta has documented the Google Objectron dataset on our Machine Learning Datasets Catalogue.
Paul created an example for using Hub, Tensorboard & Docker to train a model in PyTorch.
Jinyi Chen is currently finalizing the Chinese version of the readme! Let us know if you’d like to translate it into other languages.
Bikram Maharjan is working on support for additional image formats in hub.auto Also thanks to Suhaas Neel for the multiple PRs he’s working on!
GSOC 2022
GSOC proposals opened yesterday, make sure you contribute/finalize your PRs by April 19 and apply!
1
2
3
4
5
6
import requests
import tqdm
from typing import List
#financial reports of Amazon, but can be replaced by any URLs of pdfs
urls = ['https://s2.q4cdn.com/299287126/files/doc_financials/Q1_2018_-_8-K_Press_Release_FILED.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/Q2_2018_Earnings_Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q318-Amazon-Earnings-Press-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/AMAZON.COM-ANNOUNCES-FOURTH-QUARTER-SALES-UP-20-TO-$72.4-BILLION.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/Q119_Amazon_Earnings_Press_Release_FINAL.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q2-2019-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q3-2019-Amazon-Financial-Results.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q4-2019-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/Q1/AMZN-Q1-2020-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q2/Q2-2020-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q4/Amazon-Q4-2020-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q1/Amazon-Q1-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q2/AMZN-Q2-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q3/Q3-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q4/business_and_financial_update.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q1/Q1-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q2/Q2-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q3/Q3-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q4/Q4-2022-Amazon-Earnings-Release.pdf' ]
React 2023
---------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import requests
import tqdm
from typing import List
#financial reports of Amazon, but can be replaced by any URLs of pdfs
urls = ['https://s2.q4cdn.com/299287126/files/doc_financials/Q1_2018_-_8-K_Press_Release_FILED.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/Q2_2018_Earnings_Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q318-Amazon-Earnings-Press-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/AMAZON.COM-ANNOUNCES-FOURTH-QUARTER-SALES-UP-20-TO-$72.4-BILLION.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/Q119_Amazon_Earnings_Press_Release_FINAL.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q2-2019-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q3-2019-Amazon-Financial-Results.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q4-2019-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/Q1/AMZN-Q1-2020-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q2/Q2-2020-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q4/Amazon-Q4-2020-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q1/Amazon-Q1-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q2/AMZN-Q2-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q3/Q3-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q4/business_and_financial_update.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q1/Q1-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q2/Q2-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q3/Q3-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q4/Q4-2022-Amazon-Earnings-Release.pdf' ]
def load_reports(urls: List[str]) -> List[str]:
""" Load pages from a list of urls"""
pages = []
for url in tqdm.tqdm(urls):
r = requests.get(url)
path = url.split('/')[-1]
with open(path, 'wb') as f:
f.write(r.content)
loader = PagedPDFSplitter(path)
local_pages = loader.load_and_split()
pages.extend(local_pages)
return pages
pages = load_reports(urls)