Hot Dog Not Hot Dog - a tasty tale of best practices for classification models for computer vision

Have you ever tried using not-so-popular datasets such as CelebA or ag_news using services such as Tensorflow Datasets (tfds) or torchvision.datasets? Well I have and most of the time, you'll run into several errors finally resorting to traditional methods such as zip and tar. The following is a story of how to deal with this and other problems that arise during machine and deep learning tasks. We'll take a look at a classic Hot Dog / Not Hot Dog example, with a twist - it would be so much easier than the process you might've grown accustomed to at the image pre-processing stage of your computer vision tasks. Sounds delicious? I know! Try not to think of hot dogs or get hungry over the next couple of minutes, though.

What's Software 2.0, and why it's not really possible without Data 2.0

Andrej Karpathy famously referred to neural networks as Software 2.0, stating that the code behind many of the applications currently in use is much more abstract, such as the weights of a neural network. Software 2.0 increasingly relies on unstructured data - images, videos, text, etc. All this data is stored and utilized inefficiently in data lakes, data warehouses or object storage. This forces us, machine learning engineers, to play ketchup (pun intended) with each other, trying to find a incremental improvement to the convoluted problem of data wrangling, but we never really solve it. Enter Hub by Activeloop.

Setting up (the grill)

Hub allows you to store your computer vision datasets as cloud-native multidimensional arrays, so you can seamlessly access and work with it from any machine. You can even version control datasets similarly to git version control. Each version doesn't store the exact copy but rather the differences. That new paradigm of working with data is called Data 2.0. You could also think of it as a system where repositories are datasets and commits are made up of additions and edits of the labels.

Using Activeloop Hub you can work with public or your own private data, locally or on any cloud. In this tutorial, we'll upload the Hot-Dog-Not-Hot-Dog dataset to the Activeloop Hub's platform, as well as visualize it within the web app. To load a public dataset, one needs to write dozens of lines of code and spend hours accessing and understanding the API as well as downloading the data. With Hub, you only need two lines of code, and you can get started working on your dataset in a couple of seconds. First things first, we install the python package using pip

pip install hub

To upload your own data, you'll need to register and authenticate into Hub. You can register for an account at this link If you're planning to follow this tutorial with my dataset, there's no need to register!

Getting Started 🚀

You can access popular computer vision datasets in Hub by following a straight-forward convention. For example to get the first 1000 images of the famous Google Objectron dataset that we've released a while ago, we can run the following snippet:

from hub import Dataset

objectron = Dataset("google/bike")

In this blogpost, however, we'll create our own Dataset and then upload it to the Activeloop platform. Let's get started.

Building the Barbe-cute Schema

Schemas are an essential part of the Activeloop Hub ecosystem. They define the structure, shapes, data types, meta information (image channels, class names, etc.) and special serialization/deserialization methods. Currently there are about a dozen available Hub schemas such as Image, Mask, Segmentation and Text. For the complete list, please visit this link.

As we're trying to build a Binary Image Classifier, our dataset schema contains only two components, namely:

  • image
  • label

We define a Hub schema as a standard Python dict. We use the ClassLabel and Image from hub.schema and specify some parameters such as shape, max_shape, dtype and num_classes.

Next up we'll create a instance of a Hub Dataset by specifying the:

  • name/path: Either the file path to the dataset or the tag that will be used to identify the dataset on Activeloop platform, such as sauravmaheshkar/hot-dog-not-hot-dog-train. (A tag could point to either a remote or local path.)
  • mode: Reading (r) & writing (w) mode
  • shape: The Shape of the Dataset (follows numpy shape convention, such as (4,))
  • schema: The Schema for the dataset defined as a dict

Mustard or mayo? Applying data transformations 👎🏻 → 👍

Hub Transform provides a functionality to modify the samples of the dataset or create a new dataset from the existing one. To apply these modifications one needs to add a @hub.transform decorator to any custom function. User defined transform function is applied to every sample in the input. It takes in an iterator or a Hub dataset, and output another dataset with the specified schema. Numerous optimizations are done behind the scenes (such as chunking), to efficiently process and store the dataset.

In this transform we associate a label 0 if the image is in the not_hot_dog folder and 1 if it is in the hot_dog folder.

Upload your computer vision dataset to Activeloop Hub ⬆️

Now that we've created a hub.Dataset instance and created our transform function (fill_ds), we'll just pass in the file names and then upload our dataset to the platform.

The uploaded datasets will be available in the Activeloop’s visualization app for free. Here's the link to the datasets:

So, we meat again, advanced transformations

Let's look at what else we can do with @hub.transform(). An essential part of any computer vision data pipeline is image pre-processing. In this tutorial, we're going to use a convolutional neural network called Resnet18 to train a Binary Image Classifier. All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded into a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225].

Thus, we'll create another dataset, this time with an image size of (224, 224, 3), and then normalize the images using torchvision.transforms.

Link to the Resized Dataset on the Activeloop's platform for dataset visualization.

The Model 👷‍♀️

Transfer Learning

The main aim of transfer learning (TL) is to implement a model quickly i.e. instead of creating a DNN (dense neural network) from scratch, the model will transfer the features it has learned from the different dataset that has performed a similar task. This transaction is also known as knowledge transfer.


ResNet was one of the most innovative deep learning models in the computer vision/deep learning community in the last few years. A residual network, or ResNet for short, is a DNN that helps to build deeper neural networks by utilizing skip connections or shortcuts to jump over some layers. This helps solve the problem of vanishing gradients.

There are different versions of ResNet, including ResNet-18, ResNet-34, ResNet-50, and so on. The numbers denote layers, although the architecture is the same. ResNet-18 is thus18 layers deep.

In the end, we just add an Adaptive Pooling Layer and a Fully Connected Layer with output dimensions equal to the number of classes.

Let's cook now (train your computer vision model)

Now that we have applied the necessary data transformations (resized and normalized our images) and created a model, the next step is to train the model. We'll fetch the resized dataset and use the .to_pytorch() function to convert the dataset into PyTorch compatible format. We'll create a DataLoader instance from the converted dataset and simply train our model.

To find out more about Hub and how to use it in your own projects, visit the Activeloop Hub GitHub repository. For more advanced data pipelines like uploading large datasets or applying many transformations, please refer to the documentation. If you want to condiment...compliment my puns or have more questions regarding using Hub, join our community slack channel!

Unifying and abstracting away infrastructure for easier and highly efficient machine and deep learning.