New LangChain & Vector DBs course. Enroll nowLangChain & Vector DBs: 60+ lessons & projects in our course. Enroll for free

  • ActiveLoop
    • Solutions

      INDUSTRIES

      • agricultureAgriculture
        agriculture_technology_agritech
      • audioAudio Processing
        audio_processing
      • roboticsAutonomous & Robotics
        autonomous_vehicles
      • biomedicalBiomedical & Healthcare
        Biomedical_Healthcare
      • multimediaMultimedia
        multimedia
      • safetySafety & Security
        safety_security

      CASE STUDIES

      • IntelinAir
      • Learn how IntelinAir generates & processes datasets from petabytes of aerial imagery at 0.5x the cost

      • Earthshot Labs
      • Learn how Earthshot increased forest inventory management speed 5x with a mobile app

      • Ubenwa
      • Learn how Ubenwa doubled ML efficiency & improved scalability for sound-based diagnostics

      ​

      • Sweep
      • Learn how Sweep powered their code generation assistant with serverless and scalable data infrastructure

      • AskRoger
      • Learn how AskRoger leveraged Retrieval Augmented Generation for their multimodal AI personal assistant

      • TinyMile
      • Enhance last mile delivery robots with 10x quicker iteration cycles & 30% lower ML model training cost

      Company
      • About
      • Learn about our company, its members, and our vision

      • Contact Us
      • Get all of your questions answered by our team

      • Careers
      • Build cool things that matter. From anywhere

      Docs
      Resources
      • blogBlog
      • Opinion pieces & technology articles

      • tutorialTutorials
      • Learn how to use Activeloop stack

      • notesRelease Notes
      • See what's new?

      • newsNews
      • Track company's major milestones

      • langchainLangChain
      • LangChain how-tos with Deep Lake Vector DB

      • glossaryGlossary
      • Top 1000 ML terms explained

      • deepDeep Lake Academic Paper
      • Read the academic paper published in CIDR 2023

      • deepDeep Lake White Paper
      • See how your company can benefit from Deep Lake

      Pricing
  • Log in
A Simpson's quick start guide to any Machine Learning image classification project with organized trackable datasets
    • Back
      • Blog

    A Simpson's quick start guide to any Machine Learning image classification project with organized trackable datasets

    Getting data ready to train a machine learning model may make you say "¡Ay, caramba!" at times, just like Bart Simpson. Unless you're using Activeloop Hub, of course. Read a Springfield-inspired multiclass classification tutorial to see for yourself.
    • Margaux Masson-Forsythe

      Margaux Masson-Forsythe

      on May 26, 202111 min read

    • Upvotes: 0

    • Share:

  • Getting data ready to train a Machine Learning (ML) model is usually a very time-consuming task and can end up representing half of the time spent on a Machine Learning project. Starting quickly and efficiently is crucial for most projects. This article will help you start your multiclass classification projects in a second!

    We all know that real world data is messy. However, without a clean, organized, and easily accessible dataset, Machine Learning projects won’t lead to good results, ever. Try changing every hyperparameter a hundred times, if you don’t have a good dataset, this is a total waste of time and energy.

    Therefore, starting any ML projects with an organized structure and the right tools is more than essential. In this article, we will use an easy and efficient way to start a multiclass classification project using the new amazing features from Activeloop Hub - a dataset management tool for deep learning applications (with a focus on computer vision).

    Automatic creation of the dataset with hub.auto

    For this example, we will use the Kaggle Simpsons Characters Dataset (if you too, you started to re-watch ALL episodes from the beginning when the pandemic started, and you are now only halfway through all seasons, you will have a lot of fun with this project). This Kaggle dataset gathers jpg images of every character directly taken and labeled from TV show episodes.

    1*64laQlCh-57A6AyTXAeRWQ

    It can easily be downloaded using this command line:

    export KAGGLE_USERNAME=”xxxx” && export KAGGLE_KEY=”xxx” && mkdir -p data && cd data && kaggle datasets download -d alexattia/the-simpsons-characters-dataset && unzip -n the-simpsons-characters-dataset.zip
    

    Now, let’s take a look at the structure of the directory:

    1*e ZOB72he5mEA88UVMeCMw

    We can see that all characters have their own subfolder with their name. Lisa would be so happy.

    Once we have the dataset downloaded, we use the Hub feature called Auto Create that will parse the image classification dataset. First, we need to install Hub, if not already done:

    1pip install hub==1.3.5
    2

    Then:

    1from hub import Dataset
    2
    3dataset_path = './data/the-simpsons-characters-dataset/simpsons_dataset/simpsons_dataset'
    4
    5ds = Dataset.from_path(dataset_path)
    6

    NB: the variable dataset_path is the path to the dataset’s directory that contains all the images organized in subfolders corresponding to their respective classes — for example, all Lisa Simpson images are in the subfolder “./data/the-simpsons-characters-dataset/simpsons_dataset/lisa_simpson”. The image classification directory needs to be organized like this for the Hub Auto Create feature to be able to work correctly.

    We can then take a look at the dataset ds:

    1print(ds.shape)
    2

    returns: (20933,). So we know there are 20933 images in the dataset.

    1print(ds.schema)
    2

    returns:

    1SchemaDict({'image': Image(shape=(None, None, None), dtype='uint8', max_shape=(1072, 1912, 3)), 'label': ClassLabel(shape=(), dtype='uint16', names=['abraham_grampa_simpson', 'agnes_skinner', 'apu_nahasapeemapetilon', 'barney_gumble', 'bart_simpson', 'carl_carlson', 'charles_montgomery_burns', 'chief_wiggum', 'cletus_spuckler', 'comic_book_guy', 'disco_stu', 'edna_krabappel', 'fat_tony', 'gil', 'groundskeeper_willie', 'homer_simpson', 'kent_brockman', 'krusty_the_clown', 'lenny_leonard', 'lionel_hutz', 'lisa_simpson', 'maggie_simpson', 'marge_simpson', 'martin_prince', 'mayor_quimby', 'milhouse_van_houten', 'miss_hoover', 'moe_szyslak', 'ned_flanders', 'nelson_muntz', 'otto_mann', 'patty_bouvier', 'principal_skinner', 'professor_john_frink', 'rainier_wolfcastle', 'ralph_wiggum', 'selma_bouvier', 'sideshow_bob', 'sideshow_mel', 'snake_jailbird', 'troy_mcclure', 'waylon_smithers'], num_classes=42)})
    2

    We see here that there are 42 classes in the dataset ds.

    Let’s visualize 6 random images from the dataset:

    1def show_image_in_ds(ds, idx=1):
    2    image = ds[‘image’, idx].compute()
    3    label = ds[‘label’, idx].compute(label_name=True)
    4    print(“Image:”)
    5    plt.imshow(image)
    6    plt.show()
    7    print(“Label: \”%s\”” % (label))
    8
    9import random
    10num_images_to_display = 6
    11for id in range(0,num_images_to_display):
    12    show_image_in_ds(ds, random.randint(0,ds.shape[0]))
    13

    efnwerifwxfighwfmriu

    As we can see here, the images have different sizes and need to be resized to a common size for training. For this, we can use the feature Hub transform feature:

    1import hub
    2from skimage.transform import resize
    3from skimage import img_as_ubyte
    4
    5# resize images
    6new_shape = (256, 256, 3)
    7new_schema = {
    8    "image": schema.Image(shape=new_shape, dtype="uint8"),
    9    "label": schema.ClassLabel(names=['abraham_grampa_simpson', 'agnes_skinner', 'apu_nahasapeemapetilon', 'barney_gumble', 'bart_simpson', 'carl_carlson', 'charles_montgomery_burns', 'chief_wiggum', 'cletus_spuckler', 'comic_book_guy', 'disco_stu', 'edna_krabappel', 'fat_tony', 'gil', 'groundskeeper_willie', 'homer_simpson', 'kent_brockman', 'krusty_the_clown', 'lenny_leonard', 'lionel_hutz', 'lisa_simpson', 'maggie_simpson', 'marge_simpson', 'martin_prince', 'mayor_quimby', 'milhouse_van_houten', 'miss_hoover', 'moe_szyslak', 'ned_flanders', 'nelson_muntz', 'otto_mann', 'patty_bouvier', 'principal_skinner', 'professor_john_frink', 'rainier_wolfcastle', 'ralph_wiggum', 'selma_bouvier', 'sideshow_bob', 'sideshow_mel', 'snake_jailbird', 'troy_mcclure', 'waylon_smithers'])
    10}
    11
    12@hub.transform(schema=new_schema)
    13def resize_transform(index):
    14    image = resize(ds['image', index].compute(), new_shape, anti_aliasing=True)
    15    image = img_as_ubyte(image)  # recast from float to uint8
    16    label = int(ds['label', index].compute())
    17    return {
    18        "image": image,
    19        "label": label
    20    }
    21
    22ds_r = resize_transform(range(ds.shape[0]))
    23

    Now we want to store the resized dataset in Hub:

    1url = "margauxmforsythe/simpsons_resized_256x256"
    2# This will take some time as there are 20k images in the dataset
    3ds_r.store(url)
    4

    Then the dataset is available and can be visualized in the Activeloop’s visualization app:

    image (2)

    or can be loaded using the url:

    1ds_from_hub = Dataset(url)
    2
    3# Visualize the images and labels
    4def show_image_in_ds(ds, idx=1):
    5    image = ds['image', idx].compute()
    6    label = ds['label', idx].compute(label_name=True)
    7    print("Image:")
    8    print(image.shape)
    9    plt.imshow(image)
    10    plt.show()
    11    print("Label: \"%s\"" % (label))
    12
    13for i in range(6):
    14    show_image_in_ds(ds_from_hub, i)
    15

    image (4)

    Dataset Filtering / Variants of the same dataset

    Using the filter feature of Hub, we can easily create subsets of the dataset or get rid of elements not needed in the training.

    Create a subset dataset with only some selected characters

    Maggie

    For example, if we want to create a subset of Maggie’s images, we filter the dataset and only keep the items that labels start with “maggie”:

    1# Creates a DatasetView object for a subset of the Dataset.
    2ds_only_maggie = ds_from_hub.filter(lambda x: x["label"].compute(label_name=True).startswith("maggie"))
    3

    So here, the filter takes in the dataset ds and return True or False if the label start with “maggie”. Then the function is applied to all the items of the datasetview and retains only the items that return True, that is to say, Maggie’s images.

    We can check if the number of images we now have in the subset is correct:

    1number_maggie_images_in_subset = len(ds_only_maggie)
    2path_to_maggie_images = './data/the-simpsons-characters-dataset/simpsons_dataset/simpsons_dataset/maggie_simpson'
    3number_maggie_imgs = len(glob(f"{path_to_maggie_images}/*.jpg"))
    4assert number_maggie_images_in_subset == number_maggie_imgs
    5print(number_maggie_images_in_subset)
    6

    which returns: 128. So we know we have 128 images of Maggie in the subset ds2.

    With the same logic, we can create a subset without Maggie:

    1ds_without_maggie = ds.filter(lambda x: not x["label"].compute(label_name=True).startswith("maggie"))
    2print(ds.shape[0] - number_maggie_images_in_subset == len(ds_without_maggie)) #shape is (20805,)
    3

    which returns True, so we know that all 128 images of Maggie were removed.

    A Simpsons’ Family Photo (Dataset)

    Now we want to create a subset of the Simpsons family only: Maggie, Marge, Lisa, Bart, and Homer:

    1# Creates a DatasetView object for a subset of the Dataset.
    2ds_simpsons_family = ds_from_hub.filter(lambda x: x["label"].compute(label_name=True).startswith("maggie")
    3or x["label"].compute(label_name=True).startswith("marge")
    4or x["label"].compute(label_name=True).startswith("lisa")
    5or x["label"].compute(label_name=True).startswith("bart")
    6or x["label"].compute(label_name=True).startswith("homer"))
    7
    8print(len(ds_simpsons_family)) #returns 6361
    9

    There are 6361 images of the members of the Simpsons family.

    Monitor your datasets without “D’oh!”-s

    “Mom, look, I found something more fun than complaining!”
    — Lisa Simpson

    Datasets are, as we said before, the most important part of a training. So why not treat them as we treat scripts? When a training script is modified, we often want to know what changes were made, so that, if something breaks, we can throw back to the previous version of the script — and this is usually done using git.

    So, why not do the exact same thing with the datasets? They are even more important than the training script!

    Well, that’s what Hub version control is doing. Here is an example with the different versions of the dataset (subsets) we created previously:

    Create a new commit “hello world” in the master branch:

    1ds = Dataset(url)
    2ds.checkout("master")
    3a = ds.commit("first commit")
    4

    Create a new branch called “subsets”:

    1ds.checkout("subsets", create=True)  # creates a new branch
    2ds.flush()
    3print(ds.branches) # returns dict_keys(['master', 'subsets'])
    4ds.log()
    5

    The ds.log() returns:

    1Current Branch: subsets
    2
    3commit 7d8d6c7f891139dba5c13ea57360b854ac6990d6 (master) 
    4Author: margauxmforsythe
    5Commit Time:  2021-05-20 20:22:46
    6Message: "first commit"
    7

    Showing that we are on the branch “subset” and that there was one commit “hello world” sent to the master branch.

    Create a commit with only Maggie’s images in the “subsets” branch:

    1ds.checkout("subsets") # checkout to the subsets branch
    2
    3# Filter the dataset and only keep Maggie's images
    4dt = ds.filter(lambda x: x["label"].compute(label_name=True).startswith("maggie"))
    5dt.commit("Maggie images subset")
    6ds.log()
    7

    Now the log shows that we are still on the branch “subsets” but now, another commit “Maggie images subset” has been sent to the “subsets” branch:

    1Current Branch: subsets
    2
    3commit 1b54aa2185d3f61167737a860f7205e15aeef7b6 (subsets) 
    4Author: margauxmforsythe
    5Commit Time:  2021-05-20 20:25:04
    6Message: "Maggie images subset"
    7
    8commit 7d8d6c7f891139dba5c13ea57360b854ac6990d6 (master) 
    9Author: margauxmforsythe
    10Commit Time:  2021-05-20 20:22:46
    11Message: "first commit"
    12

    Commit subset with the Simpsons family:

    1# Filters the Simspons from the datasetdt = ds.filter(lambda x: x["label"].compute(label_name=True).startswith("maggie")
    2or x["label"].compute(label_name=True).startswith("marge")
    3or x["label"].compute(label_name=True).startswith("lisa")
    4or x["label"].compute(label_name=True).startswith("bart")
    5or x["label"].compute(label_name=True).startswith("homer"))
    6c = dt.commit("Simpsons family subset")
    7ds.log()
    8

    And now the log shows the three commits:

    1Current Branch: subsets
    2
    3commit 3cf078659a6499f9e6e8bf163cc6926ab2ab3d37 (subsets) 
    4Author: margauxmforsythe
    5Commit Time:  2021-05-20 20:34:31
    6Message: "Simpsons family subset"
    7
    8commit 1b54aa2185d3f61167737a860f7205e15aeef7b6 (subsets) 
    9Author: margauxmforsythe
    10Commit Time:  2021-05-20 20:25:04
    11Message: "Maggie images subset"
    12
    13commit 7d8d6c7f891139dba5c13ea57360b854ac6990d6 (master) 
    14Author: margauxmforsythe
    15Commit Time:  2021-05-20 20:22:46
    16Message: "first commit"
    17

    And finally, we want to go back to the first commit on master branch:

    1ds.checkout(a) # reminder we ran: a = ds.commit("first commit")
    2

    which could also be done with this line using the commit id shown in the log:

    1ds.checkout('7d8d6c7f891139dba5c13ea57360b854ac6990d6') # from log
    2

    So now, we have two branches and three commits for the dataset corresponding to the url “margauxmforsythe/simpsons_resized_256x256”.

    Saving the Simpsons family subset as a separate Dataset

    Now if we want to use the subset with only the images of the Simpsons family, we can save the subset we created previously and use it for training — but keep the information that there are 42 classes in the original dataset so that we can train with more characters later:

    1ds_S = ds_simpsons_family.store('margauxmforsythe/simpsons_family')
    2ds_S
    3

    which returns:

    1Dataset(schema=SchemaDict({'image': Image(shape=(256, 256, 3), dtype='uint8'), 'label': ClassLabel(shape=(), dtype='uint8', names=['abraham_grampa_simpson', 'agnes_skinner', 'apu_nahasapeemapetilon', 'barney_gumble', 'bart_simpson', 'carl_carlson', 'charles_montgomery_burns', 'chief_wiggum', 'cletus_spuckler', 'comic_book_guy', 'disco_stu', 'edna_krabappel', 'fat_tony', 'gil', 'groundskeeper_willie', 'homer_simpson', 'kent_brockman', 'krusty_the_clown', 'lenny_leonard', 'lionel_hutz', 'lisa_simpson', 'maggie_simpson', 'marge_simpson', 'martin_prince', 'mayor_quimby', 'milhouse_van_houten', 'miss_hoover', 'moe_szyslak', 'ned_flanders', 'nelson_muntz', 'otto_mann', 'patty_bouvier', 'principal_skinner', 'professor_john_frink', 'rainier_wolfcastle', 'ralph_wiggum', 'selma_bouvier', 'sideshow_bob', 'sideshow_mel', 'snake_jailbird', 'troy_mcclure', 'waylon_smithers'], num_classes=42)}), url='margauxmforsythe/simpsons_family', shape=(6361,), mode='w')
    2

    And if we check in the web app, we see there are 6361 images.

    1*lWkrgwgPBmySdYEdhz 1kg

    Notebook for the dataset manipulation with Hub features here.

    Now you can manipulate datasets easier than ever before, and start a simple training! Let’s try it!

    For the training, we will only use the Simpsons family subset and a simple CNN. The first step is to get the dataset ready for training — we will use Tensorflow and so, will use the Hub feature to_tensorflow:

    1def to_model_fit(item):
    2    x = item["image"]/255 # normalize
    3    y = item["label"]
    4    return (x, y)
    5image_count = len(ds_S)
    6print(f"Images count: {image_count}") #Images count: 6361
    7
    8ds_tf = ds_S.to_tensorflow(include_shapes=True)
    9ds_tf = ds_tf.map(lambda x: to_model_fit(x))
    10

    Then we need to shuffle and split the dataset in the train set and validation set with a ratio of 80% of the images used for training, 20% used for validation:

    1train_size = int(0.8 * image_count)
    2val_size = int(0.2 * image_count)
    3batch_size = 8
    4print(f"{train_size} training images and {val_size} validation images. Batch size of {batch_size}")
    5list_ds = ds_tf.shuffle(image_count)
    6val_ds = ds_tf.take(val_size)
    7train_ds = ds_tf.skip(val_size)
    8train_ds = train_ds.shuffle(train_size)
    9train_ds = train_ds.batch(batch_size)
    10val_ds = val_ds.shuffle(val_size)
    11val_ds = val_ds.batch(batch_size)
    12
    1=> 5088 training images and 1272 validation images. Batch size of 12
    2

    Now we can define the model, compile it and run the training:

    1model = Simple_CNN_With_Dropout(num_classes)
    2model.compile(optimizer=tf.keras.optimizers.Adam(0.001), 
    3loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=["accuracy"] )
    4model.fit(train_ds, validation_data=val_ds, epochs = 60)
    5

    Evaluation on the test set

    We first need to set up the test set in the same way we set up the previous dataset, by using hub.auto and hub.transform features. We did have to put the test images in subfolders corresponding to their classes beforehand:

    1*vP3kkQlb1-AJeDxMs-lipg

    1test_set_path = "./data/the-simpsons-characters-dataset/kaggle_simpson_testset/test"
    2ds_test = Dataset.from_path(test_set_path)
    3
    4# resize images
    5new_shape = (256, 256, 3)
    6new_schema = {
    7    "image": schema.Image(shape=new_shape, dtype="uint8"),
    8    "label": schema.ClassLabel(names=['bart_simpsons', 'homer_simpsons', 'lisa_simpsons', 'maggie_simpson', 'marge_simpson'])
    9}
    10
    11@hub.transform(schema=new_schema)
    12def resize_transform(index):
    13    image = resize(ds_test['image', index].compute(), new_shape, anti_aliasing=True)
    14    image = img_as_ubyte(image)  # recast from float to uint8
    15    label = int(ds_test['label', index].compute())
    16    return {
    17        "image": image,
    18        "label": label
    19    }
    20ds_r = resize_transform(range(ds_test.shape[0]))
    21ds_test = ds_r.store("margauxmforsythe/simpsons_dataset_test")
    22

    Finally, we ran the model on the test set:

    1ds_test = Dataset("margauxmforsythe/simpsons_dataset_test")
    2ds_test_pred = ds_test.to_tensorflow(include_shapes=True).batch(1)
    3ds_tf = ds_test_pred.map(lambda x: to_model_fit(x))
    4model.predict(ds_tf)
    5predictions_test_ds = model.predict(ds_tf)
    6y_pred = []
    7y_true = []
    8i = 0
    9
    10for img, label in ds_tf:
    11  y_true.append(classes_family[label.numpy()[0]])
    12  y_pred.append(classes[np.argmax(predictions_test_ds[i])])
    13  plt.imshow(img[0])
    14  plt.show()
    15  print(f"Predicted class: {classes[np.argmax(predictions_test_ds[i])]}, real class: {classes_family[label.numpy()[0]]}")
    16  i = i + 1
    17

    These are some of the results from the predictions on the test set (there was no example of Maggie in the test set):

    merge from ofoct (1)

    The final confusion matrix after 60 epochs:

    1*z6Jla9ZYKRFQ6 O68DUKaQ

    Training notebook using the Hub Datasets is here.

    Oh, so they have internet on computers now! — Homer

    If you have any questions regarding this tutorial, I’ll be at Moe’s… ehm, the Deep Lake Slack Community. Feel free to hit us up there - we might even have donuts!

    • Previous
        • Blog
      • HDF5 (Hierarchical Data Format 5) vs Hub. Creating performant Computer Vision datasets

      • on Sep 28, 2021
    • Next
        • Tutorials
        • News
      • How to access Google Objectron Dataset in Less Than 5 Seconds

      • on May 3, 2021
cubescubescubescubescubescubes
  • Subscribe to our newsletter for more articles like this
  • deep lake database

    Deep Lake. Database for AI.

    • Solutions
      AgricultureAudio ProcessingAutonomous Vehicles & RoboticsBiomedical & HealthcareMultimediaSafety & Security
    • Company
      AboutContact UsCareersPrivacy PolicyTerms & Conditions
    • Resources
      BlogDocumentationDeep Lake WhitepaperDeep Lake Academic PaperHumans in the Loop Podcast
  • Tensie

    Featured by

    featuredfeaturedfeaturedfeatured