• ActiveLoop
    • Solutions
      Industries
      • agriculture
        Agriculture
      • audio proccesing
        Audio Processing
      • autonomous_vehicles
        Autonomous & Robotics
      • biomedical_healthcare
        Biomedical & Healthcare
      • generative_ai_and_rag
        Generative AI & RAG
      • multimedia
        Multimedia
      • safety_security
        Safety & Security
      Case Studies
      Enterprises
      BayerBiomedical

      Chat with X-Rays. Bye-bye, SQL

      MatterportMultimedia

      Cut data prep time by up to 80%

      Flagship PioneeringBiomedical

      +18% more accurate RAG

      MedTechMedTech

      Fast AI search on 40M+ docs

      Generative AI
      Hercules AIMultimedia

      100x faster queries

      SweepGenAI

      Serverless DB for code assistant

      Ask RogerGenAI

      RAG for multi-modal AI assistant

      Startups
      IntelinairAgriculture

      -50% lower GPU costs & 3x faster

      EarthshotAgriculture

      5x faster with 4x less resources

      UbenwaAudio

      2x faster data preparation

      Tiny MileRobotics

      +19.5% in model accuracy

      Company
      Company
      about
      About
      Learn about our company, its members, and our vision
      Contact Us
      Contact Us
      Get all of your questions answered by our team
      Careers
      Careers
      Build cool things that matter. From anywhere
      Docs
      Resources
      Resources
      blog
      Blog
      Opinion pieces & technology articles
      langchain
      LangChain
      LangChain how-tos with Deep Lake Vector DB
      tutorials
      Tutorials
      Learn how to use Activeloop stack
      glossary
      Glossary
      Top 1000 ML terms explained
      news
      News
      Track company's major milestones
      release notes
      Release Notes
      See what's new?
      Academic Paper
      Deep Lake Academic Paper
      Read the academic paper published in CIDR 2023
      White p\Paper
      Deep Lake White Paper
      See how your company can benefit from Deep Lake
      Free GenAI CoursesSee all
      LangChain & Vector DBs in Production
      LangChain & Vector DBs in Production
      Take AI apps to production
      Train & Fine Tune LLMs
      Train & Fine Tune LLMs
      LLMs from scratch with every method
      Build RAG apps with LlamaIndex & LangChain
      Build RAG apps with LlamaIndex & LangChain
      Advanced retrieval strategies on multi-modal data
      Pricing
  • Book a Demo
A Simpson's quick start guide to any Machine Learning image classification project with organized trackable datasets
    • Back
      • Blog

    A Simpson's quick start guide to any Machine Learning image classification project with organized trackable datasets

    Getting data ready to train a machine learning model may make you say "¡Ay, caramba!" at times, just like Bart Simpson. Unless you're using Activeloop Hub, of course. Read a Springfield-inspired multiclass classification tutorial to see for yourself.
    • Margaux Masson-ForsytheMargaux Masson-...
    11 min readon May 26, 2021Updated May 31, 2023
  • Getting data ready to train a Machine Learning (ML) model is usually a very time-consuming task and can end up representing half of the time spent on a Machine Learning project. Starting quickly and efficiently is crucial for most projects. This article will help you start your multiclass classification projects in a second!

    We all know that real world data is messy. However, without a clean, organized, and easily accessible dataset, Machine Learning projects won’t lead to good results, ever. Try changing every hyperparameter a hundred times, if you don’t have a good dataset, this is a total waste of time and energy.

    Therefore, starting any ML projects with an organized structure and the right tools is more than essential. In this article, we will use an easy and efficient way to start a multiclass classification project using the new amazing features from Activeloop Hub - a dataset management tool for deep learning applications (with a focus on computer vision).

    Automatic creation of the dataset with hub.auto

    For this example, we will use the Kaggle Simpsons Characters Dataset (if you too, you started to re-watch ALL episodes from the beginning when the pandemic started, and you are now only halfway through all seasons, you will have a lot of fun with this project). This Kaggle dataset gathers jpg images of every character directly taken and labeled from TV show episodes.

    1*64laQlCh-57A6AyTXAeRWQ

    It can easily be downloaded using this command line:

    export KAGGLE_USERNAME=”xxxx” && export KAGGLE_KEY=”xxx” && mkdir -p data && cd data && kaggle datasets download -d alexattia/the-simpsons-characters-dataset && unzip -n the-simpsons-characters-dataset.zip
    

    Now, let’s take a look at the structure of the directory:

    1*e ZOB72he5mEA88UVMeCMw

    We can see that all characters have their own subfolder with their name. Lisa would be so happy.

    Once we have the dataset downloaded, we use the Hub feature called Auto Create that will parse the image classification dataset. First, we need to install Hub, if not already done:

    1pip install hub==1.3.5
    2

    Then:

    1from hub import Dataset
    2
    3dataset_path = './data/the-simpsons-characters-dataset/simpsons_dataset/simpsons_dataset'
    4
    5ds = Dataset.from_path(dataset_path)
    6

    NB: the variable dataset_path is the path to the dataset’s directory that contains all the images organized in subfolders corresponding to their respective classes — for example, all Lisa Simpson images are in the subfolder “./data/the-simpsons-characters-dataset/simpsons_dataset/lisa_simpson”. The image classification directory needs to be organized like this for the Hub Auto Create feature to be able to work correctly.

    We can then take a look at the dataset ds:

    1print(ds.shape)
    2

    returns: (20933,). So we know there are 20933 images in the dataset.

    1print(ds.schema)
    2

    returns:

    1SchemaDict({'image': Image(shape=(None, None, None), dtype='uint8', max_shape=(1072, 1912, 3)), 'label': ClassLabel(shape=(), dtype='uint16', names=['abraham_grampa_simpson', 'agnes_skinner', 'apu_nahasapeemapetilon', 'barney_gumble', 'bart_simpson', 'carl_carlson', 'charles_montgomery_burns', 'chief_wiggum', 'cletus_spuckler', 'comic_book_guy', 'disco_stu', 'edna_krabappel', 'fat_tony', 'gil', 'groundskeeper_willie', 'homer_simpson', 'kent_brockman', 'krusty_the_clown', 'lenny_leonard', 'lionel_hutz', 'lisa_simpson', 'maggie_simpson', 'marge_simpson', 'martin_prince', 'mayor_quimby', 'milhouse_van_houten', 'miss_hoover', 'moe_szyslak', 'ned_flanders', 'nelson_muntz', 'otto_mann', 'patty_bouvier', 'principal_skinner', 'professor_john_frink', 'rainier_wolfcastle', 'ralph_wiggum', 'selma_bouvier', 'sideshow_bob', 'sideshow_mel', 'snake_jailbird', 'troy_mcclure', 'waylon_smithers'], num_classes=42)})
    2

    We see here that there are 42 classes in the dataset ds.

    Let’s visualize 6 random images from the dataset:

    1def show_image_in_ds(ds, idx=1):
    2    image = ds[‘image’, idx].compute()
    3    label = ds[‘label’, idx].compute(label_name=True)
    4    print(“Image:”)
    5    plt.imshow(image)
    6    plt.show()
    7    print(“Label: \”%s\”” % (label))
    8
    9import random
    10num_images_to_display = 6
    11for id in range(0,num_images_to_display):
    12    show_image_in_ds(ds, random.randint(0,ds.shape[0]))
    13

    efnwerifwxfighwfmriu

    As we can see here, the images have different sizes and need to be resized to a common size for training. For this, we can use the feature Hub transform feature:

    1import hub
    2from skimage.transform import resize
    3from skimage import img_as_ubyte
    4
    5# resize images
    6new_shape = (256, 256, 3)
    7new_schema = {
    8    "image": schema.Image(shape=new_shape, dtype="uint8"),
    9    "label": schema.ClassLabel(names=['abraham_grampa_simpson', 'agnes_skinner', 'apu_nahasapeemapetilon', 'barney_gumble', 'bart_simpson', 'carl_carlson', 'charles_montgomery_burns', 'chief_wiggum', 'cletus_spuckler', 'comic_book_guy', 'disco_stu', 'edna_krabappel', 'fat_tony', 'gil', 'groundskeeper_willie', 'homer_simpson', 'kent_brockman', 'krusty_the_clown', 'lenny_leonard', 'lionel_hutz', 'lisa_simpson', 'maggie_simpson', 'marge_simpson', 'martin_prince', 'mayor_quimby', 'milhouse_van_houten', 'miss_hoover', 'moe_szyslak', 'ned_flanders', 'nelson_muntz', 'otto_mann', 'patty_bouvier', 'principal_skinner', 'professor_john_frink', 'rainier_wolfcastle', 'ralph_wiggum', 'selma_bouvier', 'sideshow_bob', 'sideshow_mel', 'snake_jailbird', 'troy_mcclure', 'waylon_smithers'])
    10}
    11
    12@hub.transform(schema=new_schema)
    13def resize_transform(index):
    14    image = resize(ds['image', index].compute(), new_shape, anti_aliasing=True)
    15    image = img_as_ubyte(image)  # recast from float to uint8
    16    label = int(ds['label', index].compute())
    17    return {
    18        "image": image,
    19        "label": label
    20    }
    21
    22ds_r = resize_transform(range(ds.shape[0]))
    23

    Now we want to store the resized dataset in Hub:

    1url = "margauxmforsythe/simpsons_resized_256x256"
    2# This will take some time as there are 20k images in the dataset
    3ds_r.store(url)
    4

    Then the dataset is available and can be visualized in the Activeloop’s visualization app:

    image (2)

    or can be loaded using the url:

    1ds_from_hub = Dataset(url)
    2
    3# Visualize the images and labels
    4def show_image_in_ds(ds, idx=1):
    5    image = ds['image', idx].compute()
    6    label = ds['label', idx].compute(label_name=True)
    7    print("Image:")
    8    print(image.shape)
    9    plt.imshow(image)
    10    plt.show()
    11    print("Label: \"%s\"" % (label))
    12
    13for i in range(6):
    14    show_image_in_ds(ds_from_hub, i)
    15

    image (4)

    Dataset Filtering / Variants of the same dataset

    Using the filter feature of Hub, we can easily create subsets of the dataset or get rid of elements not needed in the training.

    Create a subset dataset with only some selected characters

    Maggie

    For example, if we want to create a subset of Maggie’s images, we filter the dataset and only keep the items that labels start with “maggie”:

    1# Creates a DatasetView object for a subset of the Dataset.
    2ds_only_maggie = ds_from_hub.filter(lambda x: x["label"].compute(label_name=True).startswith("maggie"))
    3

    So here, the filter takes in the dataset ds and return True or False if the label start with “maggie”. Then the function is applied to all the items of the datasetview and retains only the items that return True, that is to say, Maggie’s images.

    We can check if the number of images we now have in the subset is correct:

    1number_maggie_images_in_subset = len(ds_only_maggie)
    2path_to_maggie_images = './data/the-simpsons-characters-dataset/simpsons_dataset/simpsons_dataset/maggie_simpson'
    3number_maggie_imgs = len(glob(f"{path_to_maggie_images}/*.jpg"))
    4assert number_maggie_images_in_subset == number_maggie_imgs
    5print(number_maggie_images_in_subset)
    6

    which returns: 128. So we know we have 128 images of Maggie in the subset ds2.

    With the same logic, we can create a subset without Maggie:

    1ds_without_maggie = ds.filter(lambda x: not x["label"].compute(label_name=True).startswith("maggie"))
    2print(ds.shape[0] - number_maggie_images_in_subset == len(ds_without_maggie)) #shape is (20805,)
    3

    which returns True, so we know that all 128 images of Maggie were removed.

    A Simpsons’ Family Photo (Dataset)

    Now we want to create a subset of the Simpsons family only: Maggie, Marge, Lisa, Bart, and Homer:

    1# Creates a DatasetView object for a subset of the Dataset.
    2ds_simpsons_family = ds_from_hub.filter(lambda x: x["label"].compute(label_name=True).startswith("maggie")
    3or x["label"].compute(label_name=True).startswith("marge")
    4or x["label"].compute(label_name=True).startswith("lisa")
    5or x["label"].compute(label_name=True).startswith("bart")
    6or x["label"].compute(label_name=True).startswith("homer"))
    7
    8print(len(ds_simpsons_family)) #returns 6361
    9

    There are 6361 images of the members of the Simpsons family.

    Monitor your datasets without “D’oh!”-s

    “Mom, look, I found something more fun than complaining!”
    — Lisa Simpson

    Datasets are, as we said before, the most important part of a training. So why not treat them as we treat scripts? When a training script is modified, we often want to know what changes were made, so that, if something breaks, we can throw back to the previous version of the script — and this is usually done using git.

    So, why not do the exact same thing with the datasets? They are even more important than the training script!

    Well, that’s what Hub version control is doing. Here is an example with the different versions of the dataset (subsets) we created previously:

    Create a new commit “hello world” in the master branch:

    1ds = Dataset(url)
    2ds.checkout("master")
    3a = ds.commit("first commit")
    4

    Create a new branch called “subsets”:

    1ds.checkout("subsets", create=True)  # creates a new branch
    2ds.flush()
    3print(ds.branches) # returns dict_keys(['master', 'subsets'])
    4ds.log()
    5

    The ds.log() returns:

    1Current Branch: subsets
    2
    3commit 7d8d6c7f891139dba5c13ea57360b854ac6990d6 (master) 
    4Author: margauxmforsythe
    5Commit Time:  2021-05-20 20:22:46
    6Message: "first commit"
    7

    Showing that we are on the branch “subset” and that there was one commit “hello world” sent to the master branch.

    Create a commit with only Maggie’s images in the “subsets” branch:

    1ds.checkout("subsets") # checkout to the subsets branch
    2
    3# Filter the dataset and only keep Maggie's images
    4dt = ds.filter(lambda x: x["label"].compute(label_name=True).startswith("maggie"))
    5dt.commit("Maggie images subset")
    6ds.log()
    7

    Now the log shows that we are still on the branch “subsets” but now, another commit “Maggie images subset” has been sent to the “subsets” branch:

    1Current Branch: subsets
    2
    3commit 1b54aa2185d3f61167737a860f7205e15aeef7b6 (subsets) 
    4Author: margauxmforsythe
    5Commit Time:  2021-05-20 20:25:04
    6Message: "Maggie images subset"
    7
    8commit 7d8d6c7f891139dba5c13ea57360b854ac6990d6 (master) 
    9Author: margauxmforsythe
    10Commit Time:  2021-05-20 20:22:46
    11Message: "first commit"
    12

    Commit subset with the Simpsons family:

    1# Filters the Simspons from the datasetdt = ds.filter(lambda x: x["label"].compute(label_name=True).startswith("maggie")
    2or x["label"].compute(label_name=True).startswith("marge")
    3or x["label"].compute(label_name=True).startswith("lisa")
    4or x["label"].compute(label_name=True).startswith("bart")
    5or x["label"].compute(label_name=True).startswith("homer"))
    6c = dt.commit("Simpsons family subset")
    7ds.log()
    8

    And now the log shows the three commits:

    1Current Branch: subsets
    2
    3commit 3cf078659a6499f9e6e8bf163cc6926ab2ab3d37 (subsets) 
    4Author: margauxmforsythe
    5Commit Time:  2021-05-20 20:34:31
    6Message: "Simpsons family subset"
    7
    8commit 1b54aa2185d3f61167737a860f7205e15aeef7b6 (subsets) 
    9Author: margauxmforsythe
    10Commit Time:  2021-05-20 20:25:04
    11Message: "Maggie images subset"
    12
    13commit 7d8d6c7f891139dba5c13ea57360b854ac6990d6 (master) 
    14Author: margauxmforsythe
    15Commit Time:  2021-05-20 20:22:46
    16Message: "first commit"
    17

    And finally, we want to go back to the first commit on master branch:

    1ds.checkout(a) # reminder we ran: a = ds.commit("first commit")
    2

    which could also be done with this line using the commit id shown in the log:

    1ds.checkout('7d8d6c7f891139dba5c13ea57360b854ac6990d6') # from log
    2

    So now, we have two branches and three commits for the dataset corresponding to the url “margauxmforsythe/simpsons_resized_256x256”.

    Saving the Simpsons family subset as a separate Dataset

    Now if we want to use the subset with only the images of the Simpsons family, we can save the subset we created previously and use it for training — but keep the information that there are 42 classes in the original dataset so that we can train with more characters later:

    1ds_S = ds_simpsons_family.store('margauxmforsythe/simpsons_family')
    2ds_S
    3

    which returns:

    1Dataset(schema=SchemaDict({'image': Image(shape=(256, 256, 3), dtype='uint8'), 'label': ClassLabel(shape=(), dtype='uint8', names=['abraham_grampa_simpson', 'agnes_skinner', 'apu_nahasapeemapetilon', 'barney_gumble', 'bart_simpson', 'carl_carlson', 'charles_montgomery_burns', 'chief_wiggum', 'cletus_spuckler', 'comic_book_guy', 'disco_stu', 'edna_krabappel', 'fat_tony', 'gil', 'groundskeeper_willie', 'homer_simpson', 'kent_brockman', 'krusty_the_clown', 'lenny_leonard', 'lionel_hutz', 'lisa_simpson', 'maggie_simpson', 'marge_simpson', 'martin_prince', 'mayor_quimby', 'milhouse_van_houten', 'miss_hoover', 'moe_szyslak', 'ned_flanders', 'nelson_muntz', 'otto_mann', 'patty_bouvier', 'principal_skinner', 'professor_john_frink', 'rainier_wolfcastle', 'ralph_wiggum', 'selma_bouvier', 'sideshow_bob', 'sideshow_mel', 'snake_jailbird', 'troy_mcclure', 'waylon_smithers'], num_classes=42)}), url='margauxmforsythe/simpsons_family', shape=(6361,), mode='w')
    2

    And if we check in the web app, we see there are 6361 images.

    1*lWkrgwgPBmySdYEdhz 1kg

    Notebook for the dataset manipulation with Hub features here.

    Now you can manipulate datasets easier than ever before, and start a simple training! Let’s try it!

    For the training, we will only use the Simpsons family subset and a simple CNN. The first step is to get the dataset ready for training — we will use Tensorflow and so, will use the Hub feature to_tensorflow:

    1def to_model_fit(item):
    2    x = item["image"]/255 # normalize
    3    y = item["label"]
    4    return (x, y)
    5image_count = len(ds_S)
    6print(f"Images count: {image_count}") #Images count: 6361
    7
    8ds_tf = ds_S.to_tensorflow(include_shapes=True)
    9ds_tf = ds_tf.map(lambda x: to_model_fit(x))
    10

    Then we need to shuffle and split the dataset in the train set and validation set with a ratio of 80% of the images used for training, 20% used for validation:

    1train_size = int(0.8 * image_count)
    2val_size = int(0.2 * image_count)
    3batch_size = 8
    4print(f"{train_size} training images and {val_size} validation images. Batch size of {batch_size}")
    5list_ds = ds_tf.shuffle(image_count)
    6val_ds = ds_tf.take(val_size)
    7train_ds = ds_tf.skip(val_size)
    8train_ds = train_ds.shuffle(train_size)
    9train_ds = train_ds.batch(batch_size)
    10val_ds = val_ds.shuffle(val_size)
    11val_ds = val_ds.batch(batch_size)
    12
    1=> 5088 training images and 1272 validation images. Batch size of 12
    2

    Now we can define the model, compile it and run the training:

    1model = Simple_CNN_With_Dropout(num_classes)
    2model.compile(optimizer=tf.keras.optimizers.Adam(0.001), 
    3loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=["accuracy"] )
    4model.fit(train_ds, validation_data=val_ds, epochs = 60)
    5

    Evaluation on the test set

    We first need to set up the test set in the same way we set up the previous dataset, by using hub.auto and hub.transform features. We did have to put the test images in subfolders corresponding to their classes beforehand:

    1*vP3kkQlb1-AJeDxMs-lipg

    1test_set_path = "./data/the-simpsons-characters-dataset/kaggle_simpson_testset/test"
    2ds_test = Dataset.from_path(test_set_path)
    3
    4# resize images
    5new_shape = (256, 256, 3)
    6new_schema = {
    7    "image": schema.Image(shape=new_shape, dtype="uint8"),
    8    "label": schema.ClassLabel(names=['bart_simpsons', 'homer_simpsons', 'lisa_simpsons', 'maggie_simpson', 'marge_simpson'])
    9}
    10
    11@hub.transform(schema=new_schema)
    12def resize_transform(index):
    13    image = resize(ds_test['image', index].compute(), new_shape, anti_aliasing=True)
    14    image = img_as_ubyte(image)  # recast from float to uint8
    15    label = int(ds_test['label', index].compute())
    16    return {
    17        "image": image,
    18        "label": label
    19    }
    20ds_r = resize_transform(range(ds_test.shape[0]))
    21ds_test = ds_r.store("margauxmforsythe/simpsons_dataset_test")
    22

    Finally, we ran the model on the test set:

    1ds_test = Dataset("margauxmforsythe/simpsons_dataset_test")
    2ds_test_pred = ds_test.to_tensorflow(include_shapes=True).batch(1)
    3ds_tf = ds_test_pred.map(lambda x: to_model_fit(x))
    4model.predict(ds_tf)
    5predictions_test_ds = model.predict(ds_tf)
    6y_pred = []
    7y_true = []
    8i = 0
    9
    10for img, label in ds_tf:
    11  y_true.append(classes_family[label.numpy()[0]])
    12  y_pred.append(classes[np.argmax(predictions_test_ds[i])])
    13  plt.imshow(img[0])
    14  plt.show()
    15  print(f"Predicted class: {classes[np.argmax(predictions_test_ds[i])]}, real class: {classes_family[label.numpy()[0]]}")
    16  i = i + 1
    17

    These are some of the results from the predictions on the test set (there was no example of Maggie in the test set):

    merge from ofoct (1)

    The final confusion matrix after 60 epochs:

    1*z6Jla9ZYKRFQ6 O68DUKaQ

    Training notebook using the Hub Datasets is here.

    Oh, so they have internet on computers now! — Homer

    If you have any questions regarding this tutorial, I’ll be at Moe’s… ehm, the Deep Lake Slack Community. Feel free to hit us up there - we might even have donuts!

    Share:

    • Table of Contents
    • Automatic creation of the dataset with hub.auto
    • Dataset Filtering / Variants of the same dataset
    • Create a subset dataset with only some selected characters
    • Maggie
    • A Simpsons' Family Photo (Dataset)
    • Monitor your datasets without "D'oh!"-s
    • Saving the Simpsons family subset as a separate Dataset
    • Now you can manipulate datasets easier than ever before, and start a simple training! Let’s try it!
    • Evaluation on the test set
    • Previous
        • Blog
      • HDF5 (Hierarchical Data Format 5) vs Hub. Creating performant Computer Vision datasets

      • on Sep 28, 2021
    • Next
        • Tutorials
        • News
      • How to access Google Objectron Dataset in Less Than 5 Seconds

      • on May 3, 2021

Related Articles

Machine learning engineers work less on machine learning and more on data preparation. In fact, a typical ML engineer spends more than 50% of their time preprocessing the data, rather than analyzing it.
    • Blog
Faster Machine Learning using Hub by Activeloop: Code WalkthroughNov 18, 2020
Hot Dog Not Hot Dog - that is the question! Saurav utilizes the power of Weights & Biases and Hub in this tasty tale of computer vision best practices.
    • Tutorials
Weights & Biases and Hub - best practices for tasty classification models for computer visionMay 19, 2021
HDF5 file format is one of the most popular dataset formats out there. However, it's not optimized for deep learning tasks. In this article, Margaux contrasts the performance of Hub vs HDF5 format, and explores why it is better to use Hub for CV tasks.
    • Blog
HDF5 (Hierarchical Data Format 5) vs Hub. Creating performant Computer Vision datasetsSep 28, 2021
  • deep lake database

    Deep Lake. Database for AI.

    • Solutions
      AgricultureAudio ProcessingAutonomous Vehicles & RoboticsBiomedical & HealthcareMultimediaSafety & Security
    • Company
      AboutContact UsCareersPrivacy PolicyDo Not SellTerms & Conditions
    • Resources
      BlogDocumentationDeep Lake WhitepaperDeep Lake Academic Paper
  • Tensie

    Featured by

    featuredfeaturedfeaturedfeatured