HDF5 vs Hub: Building Vision Datasets

How to efficiently create Computer Vision dataset for deep learning tasks?

Data is the key component of any Deep Learning project. Getting it right at the beginning by using the correct tools is of paramount importance. However, Computer Vision datasets are hard to manage and often require hardware with a lot of memory.

This article aims to be a simple tutorial on how to create a computer vision dataset for training a Computer Vision (CV) model. Computer vision datasets can get really heavy really fast, and dealing with such data is often complicated and limited because of memory capacities, as previously mentioned. Therefore, in this article, we will study how to create a dataset with the famous open-source Hierarchical Data Format version 5 (HDF5) file format, and then compare this to the creation of the same dataset using Hub.

What is HDF5?

The Hierarchical Data Format version 5 (HDF5) is an open-source file format that enables working with large, complex and heterogeneous data. So what is HDF5? HDF5, one of the most popular dataset formats, uses a structure similar to a file directory, where one can organize data within the file in various structured ways (just as with files stored on your computer). In short, an HDF5 file can be described as the definition of a file system (folders, subfolders, and files stored in the computer) in a single file.

🦒 🐕 🐈 Setting up the Animals-10 Kaggle dataset 🦒 🐕 🐈

For this tutorial, we will use the Animals-10 Kaggle dataset that contains around 28K animal pictures of 10 different categories (dog, cat, horse, spider, butterfly, chicken, sheep, cow, squirrel, elephant) taken from Google Images.

To download and unzip this dataset, we use this command:

!export KAGGLE_USERNAME="xxxxxxx" && export KAGGLE_KEY="xxxxxx" && kaggle datasets download -d alessiocorrado99/animals10 && unzip -n animals10.zip && rm -r animals10.zip

Let’s explore this dataset a bit…

import glob 
import os

dataset_animals = './raw-img'
file_names = glob.glob(dataset_animals + '/**/*.jpeg')

print(f'There are {len(file_names)} images in the dataset')

class_names = os.listdir(dataset_animals)
print(f'This dataset has {len(class_names)} classes: {class_names}')

➡️ There are 24209 images in the dataset
This dataset has 10 classes: [‘mucca’, ‘ragno’, ‘cane’, ‘pecora’, ‘gallina’, ‘scoiattolo’, ‘farfalla’, ‘gatto’, ‘cavallo’, ‘elefante’]

import random
import matplotlib.pyplot as plt
import cv2

# List of 4 random images indexes to visualize in the dataset
list_random_indexes = [random.randint(0,len(file_names)) for i in range(4)]

for i in list_random_indexes:
  # cv2 reads the images in BRG so we use cvtColor to convert them to RGB
  train_img = cv2.cvtColor(cv2.imread(file_names[i]), cv2.COLOR_BGR2RGB)
  print(f'Shape image: {train_img.shape}')
  label_text = os.path.basename(os.path.dirname(file_names[i]))
  print(f'Label image: {label_text}')

plt.imshow(train_img)
  plt.show()

which returns these images:

Four random images in the dataset — Image by author

1) Creating a Computer Vision dataset with Hierarchical Data Format version 5 (HDF5)

In the previous section, we reviewed what HDF5 is. An additional piece of information you need to know about HDF5 is that it is organized in groups and datasets:

Hierarchical Data Format (HDF5) Dataset — Image by author

Let’s create Animals-10 dataset in the HDF5 format!

First, we need a function that will be used as a function’s timer so that we can compare the execution times for the datasets in Hub and HDF5 format:

from functools import wraps
from time import time

def timeit(func):
    @wraps
    def wrapper(*args, **kwargs):
        start = time()
        result = func(*args, **kwargs)
        end = time()
        print(f'{func.__name__} executed in {end - start:.4f} seconds')
        return result, end - start

return wrapper

Using the python package h5py, we create the HDF5 file ./datasets/animals-10-hdf5 by going through all images in the dataset, reading them, and adding each image to the HDF5 dataset as one dataset each time with, as name, the index of the image in the list file_names. We time the execution using the function timeit defined previously:

import h5py
from tqdm import tqdm

dataset_h5_path = "./datasets/animals-10-hdf5"

@timeit
def create_hdf5_dataset(file_names):
  with h5py.File(f'./{dataset_h5_path}.h5', 'w') as hf:
      for i in tqdm(range(len(file_names))):
          # cv2 reads the images in BGR so we convert them to RGB
          image = cv2.cvtColor(cv2.imread(file_names[i]), cv2.COLOR_BGR2RGB)
          hf.create_dataset(str(i), data=image)
  return hf

hf = create_hdf5_dataset(file_names)

Time taken to create an HDF5 dataset

This process took 47.1463 seconds.

As specified previously, this HDF5 dataset is organized in datasets for each image and the name of the datasets are the indexes of the images in the list. So, we did not save the labels corresponding to these images.

Another way to do this in order to keep the information about the labeling, is to create two datasets: one called images and one called labels. We then populate these datasets using the NumPy array gathering all the images’ arrays and another Numpy array gathering all the labels’ arrays. In order to be able to do this, we need all the images to be the same size so we also use the cv2 function resize and resize the images to a random size we chose: (256,256).

So first, we create the lists of all the images and labels, and then, we create the computer vision dataset called ./datasets/animals-10-img-labels-hdf5 with these two lists as datasets:

dataset_h5_images_labels_path = "./datasets/animals-10-img-labels-hdf5"

@timeit
def hdf5_dataset():
  # Gathering all images and labels in lists to be stored in the hf dataset in separate datasets
  images = []
  labels = []

for i in tqdm(range(len(file_names))):
    image = cv2.cvtColor(cv2.imread(file_names[i]), cv2.COLOR_BGR2RGB)
    # images need to be resize for the creation of the hdf5 dataset "images"
    image = cv2.resize(image, (256,256))
    images.append(image)
    label_text = os.path.basename(os.path.dirname(file_names[i]))
    label_num = class_names.index(label_text)
    labels.append(label_num)

# Create the hdf5 dataset
  with h5py.File(f'./{dataset_h5_images_labels_path}.h5', 'w') as hf:
    images_h5_dataset = hf.create_dataset("images", np.shape(images), data=images)
    labels_h5_dataset = hf.create_dataset("labels", np.shape(labels), data=labels)

hdf5_dataset()

hdf5_dataset

➡️ This execution time is a bit higher than before but now we have the images and labels organized in two datasets.

2) An HDF5 alternative: Creating a Computer Vision dataset with Hub

Hub format is an HDF5 alternative that enables deep learning applications. Now, we will create a Hub dataset using the same Animals-10 Kaggle dataset.

In order to import hub, we use this command line (and restart the runtime afterward):

!pip install hub==2.0.9

Then, we can create our local Hub Animals-10 dataset called ./datasets/animals-10-hub and again, we time the execution:

import hub 

dataset_hub_path = "./datasets/animals-10-hub"

@timeit
def create_hub_dataset(dataset_path, file_names):
  with hub.empty(dataset_path, overwrite=True) as ds:
      # Create the tensors with names of your choice.
      ds.create_tensor('images')
      ds.create_tensor('labels')

      # Iterate through the files and append to hub dataset
      for file in tqdm(file_names):
          label_text = os.path.basename(os.path.dirname(file))
          label_num = class_names.index(label_text)

          ds.images.append(cv2.cvtColor(cv2.imread(file), cv2.COLOR_BGR2RGB))  # Append to images tensor using cv2 imread
          ds.labels.append(np.uint32(label_num)) # Append to labels tensor

ds = create_hub_dataset(dataset_hub_path, file_names)

For this dataset, we can easily create two tensors: images and labels, and then populate them by going through the file_names, reading the images with cv2 and adding them to the tensor images, but we also read the label’s names and their index in the list class_names and add those in the tensor labels.

hdf5_dataset

➡️ This execution time is a bit higher than the HDF5 dataset “one dataset per image” which was at 47.1463 seconds, and is similar to the time for the creation of the HDF5 dataset “organized in two datasets images and labels”.

Comparing incremental image loading from the datasets on local disk

Now we want to know how much time it takes to incrementally load 100 images — sequentially and randomly — on local disk, from the datasets we just created. We save these times in lists called hdf_time_seq_access, hdf_time_seq_access_img_label, and hub_time_seq_access for the sequential access, and in lists called hdf_time_random_access, hdf_time_random_access_img_label, and hub_time_random_access for the random access. Then we can easily visualize and compare them between each other:

Sequential Access

HDF5 format one dataset per image

  n_images_to_load = 100

  @timeit
  def hdf_load(k, hf):
      print([hf[str(i)] for i in range(k)])

  hdf_time_seq_access = []
  hf = h5py.File(f'./{dataset_h5_path}.h5', "r") 

  # Sequential access 0 to 100 indexes
  for k in tqdm(range(n_images_to_load)):
    hdf_time_seq_access.append(hdf_load(k, hf)[1])

HDF5 format only two datasets: images and labels

  @timeit
  def hdf_load(k, hf):
      print([hf['images'][i] for i in range(k)])
      print([hf['labels'][i] for i in range(k)])

  hdf_time_seq_access_img_label = []
  hf = h5py.File(f'./{dataset_h5_images_labels_path}.h5', "r") 

  # Sequential access 0 to 100 indexes
  for k in tqdm(range(n_images_to_load)):
    hdf_time_seq_access_img_label.append(hdf_load(k, hf)[1])

Hub format

  @timeit
  def hub_load(k, hf):
      print([ds.images[i] for i in range(k)])
      print([ds.labels[i] for i in range(k)])

  hub_time_seq_access = []
  ds = hub.load(dataset_hub_path)

  # Sequential access 0 to 100 indexes
  for k in tqdm(range(n_images_to_load)):
    hub_time_seq_access.append(hub_load(k, ds)[1])

Let’s visualize those times in a graph:

plt.plot(range(n_images_to_load), hub_time_seq_access, color='orange')
plt.plot(range(n_images_to_load), hdf_time_seq_access, color='blue')
plt.plot(range(n_images_to_load), hdf_time_seq_access_img_label, color='magenta')
plt.legend(['Hub with datasets images/labels', 'HDF', 'HDF with datasets images/labels'])

plt.xlabel('Number Reads')
plt.ylabel('Read time (ms)')
plt.title("Hub versus HDF vs HDF organized in two datasets on local disk -- sequential access")

Graph of Hub versus HDF and HDF with datasets images/labels on local disk, sequential access— Image by author

We see here that the incremental times taken to load images from the HDF5 dataset with datasets images/labels is increasing exponentially compared to the two others.

We remove this line from the plot so that we can see the comparison between Hub and HDF5 better:

Graph of Hub with images and labels versus HDF on local disk, sequential access — Image by author

As you can see, Hub and HDF5 format have a pretty similar timing when incrementally reading images from the local disk sequentially.

Random Access case for Hub vs HDF5

Now let’s take a look at the random access case: instead of reading the images and labels at the indexes 0 to 100, we read 100 images which indexes are between 0 and 1000 with a step of 10 between each index. For this, we only need to replace the loop call in the previous functions:

for k in tqdm(range(n_images_to_load)):

is replace by:

for k in tqdm(range(0, 1000, 10)):

We also use different names for the lists as specified previously: hdf_time_random_access, hdf_time_random_access_img_label, hub_time_random_access.

With the same logic, we visualize the graphs of the executions times:

Graph of Hub versus HDF and HDF with datasets images/labels on local disk, random access — Image by author

Graph of Hub with images and labels versus HDF on local disk, sequential access — Image by author

Once again, we see here that the execution times for the HDF organized in images/labels datasets keep increasing way more than when we load the images from the other datasets. And we also observe again that Hub and HDF5 have a pretty similar timing when incrementally reading images from the local disk randomly.

However, with the Hub dataset, we are actually also reading the label of the image, which is not the case with this HDF5 dataset.

Being able to save and load the labels and images quickly is an essential feature for a Deep Learning dataset’s tool. Once we have our Hub dataset that contains our images and labels, we can directly use it for training classification tasks for example, with either Pytorch or Tensorflow without writing any custom dataloader:

ds = hub.dataset('./dataset_path') # Hub Dataset

#PyTorch Dataloader
dataloader= ds.pytorch(batch_size = 16, num_workers = 2)

for data in dataloader:
    print(data)
    # Training Loop

#OR

#PyTorch Dataloader
dataloader= ds.tensorflow(batch_size = 16, num_workers = 2)

for data in dataloader:
    print(data)
    # Training Loop

This helps save even more of time!

HDF5 file format is one of the most popular and reliable formats. However, it is not optimized for deep learning tasks. Hub, on the contrary, is one of the HDF5 alternatives that provides a deep learning-native dataset format. It makes it easy to apply transformations to data in a parallel fashion, integrate with machine learning frameworks like Pytorch or Tensorflow or getting started with machine learning without the need to download entire huge datasets like Google Objectron. In all, creating computer vision datasets using the right format can make all the difference for a successful deep learning project.

GIF from [https://media.giphy.com/media/1jGiRoprWmcco/giphy.gif](https://media.giphy.com/media/1jGiRoprWmcco/giphy.gif)