• ActiveLoop
    • Solutions

      INDUSTRIES

      • agricultureAgriculture
        agriculture_technology_agritech
      • audioAudio Processing
        audio_processing
      • roboticsAutonomous & Robotics
        autonomous_vehicles
      • biomedicalBiomedical & Healthcare
        Biomedical_Healthcare
      • multimediaMultimedia
        multimedia
      • safetySafety & Security
        safety_security

      CASE STUDIES

      • IntelinAir
      • Learn how IntelinAir generates & processes datasets from petabytes of aerial imagery at 0.5x the cost

      • Earthshot Labs
      • Learn how Earthshot increased forest inventory management speed 5x with a mobile app

      • Ubenwa
      • Learn how Ubenwa doubled ML efficiency & improved scalability for sound-based diagnostics

      ​

      • Sweep
      • Learn how Sweep powered their code generation assistant with serverless and scalable data infrastructure

      • AskRoger
      • Learn how AskRoger leveraged Retrieval Augmented Generation for their multimodal AI personal assistant

      • TinyMile
      • Enhance last mile delivery robots with 10x quicker iteration cycles & 30% lower ML model training cost

      Company
      • About
      • Learn about our company, its members, and our vision

      • Contact Us
      • Get all of your questions answered by our team

      • Careers
      • Build cool things that matter. From anywhere

      Docs
      Resources
      • blogBlog
      • Opinion pieces & technology articles

      • tutorialTutorials
      • Learn how to use Activeloop stack

      • notesRelease Notes
      • See what's new?

      • newsNews
      • Track company's major milestones

      • langchainLangChain
      • LangChain how-tos with Deep Lake Vector DB

      • glossaryGlossary
      • Top 1000 ML terms explained

      • deepDeep Lake Academic Paper
      • Read the academic paper published in CIDR 2023

      • deepDeep Lake White Paper
      • See how your company can benefit from Deep Lake

      Pricing
  • Log in
Accelerate your Machine Learning Workflow
    • Back
      • Tutorials

    Accelerate your Machine Learning Workflow

    ML projects sometimes get stuck due to... fishy reasons. Margaux compares Activeloop Cloud storage against AWS S3 using a famous aquatic dataset. Uploading datasets to Hub using parallel computing was 2x faster than AWS CLI and ~20x faster than boto3!
    • Margaux Masson-Forsythe

      Margaux Masson-Forsythe

      on Sep 13, 202111 min read

    • Upvotes: 0

    • Share:

  • Accelerate your Machine Learning Workflow main image (3)

    Introduction

    The workflow of a Machine Learning (ML) project includes all the stages required to put in production a model that performs a task such as image classification. The figure below gives a high-level overview of these stages:

    Machine Learning workflow — Image by author (icons from [https://www.flaticon.com/](https://www.flaticon.com/))

    If you are familiar with Machine Learning workflow, you probably know that the most time-expensive step is — most of the time — the Data Preparation one. There are several reasons why this step takes up a lot of time:

    • Working with real-world data is complex and often requires a lot of cleaning

    • Data-preprocessing algorithms are often slow which exponentially gets worse when the amount of data increases, or if the pre-processing steps have to be re-run for some reasons

    • Choosing the optimal data storage solution is not easy and can be quite expensive

    • Sharing data with the rest of the team is harder than sharing code

    In this article, we will analyze the time that it takes to perform an easy task from the data preparation stage: Uploading a dataset to the Cloud that can be shared with others. For this, we will compare how long it takes the send a computer vision dataset to an Amazon Web Service (AWS) s3 bucket versus sending it to Hub.

    Benchmark study: Uploading a dataset to the Cloud

    Most ML teams use Cloud storage to store and access data. Especially when working on computer vision projects, the datasets can get really heavy and are hard to store locally. Some companies use a NAS (Network Attached Storage) to store their data but, in the past couple of years, a lot of them are starting to switch to Cloud storage instead. It is indeed much easier not to have to deal with configuring and maintaining an on-growing NAS.

    So, which is the fastest between s3 or Hub?

    For this comparison, we chose to use the Kaggle dataset: A Large-Scale Dataset for Fish Segmentation and Classification.

    1. Download the Dataset

    First, let’s download the dataset from Kaggle using this command:

    export KAGGLE_USERNAME="xxxxxx" && export KAGGLE_KEY="xxxxxx" && kaggle datasets download -d crowww/a-large-scale-fish-dataset
    

    This will download the archived dataset in the workspace, and now need to be unzipped:

    unzip -n a-large-scale-fish-dataset.zip
    

    The dataset is organized by subfolders that each corresponds to a class (fish species):

    Dataset’s architecture — Image by author

    This dataset is quite large, in total, we have 18000 png images. Each fish species subfolder has two subfolder: one that contains the RGB images of the fished and one that contains the corresponding segmentation mask (also called the GT=Ground Truth).

    We gather all the paths to the images in a list called files_in_fish_set_images and all the paths to the masks the list files_in_fish_set_GT:

    import fnmatch
    
    files_in_fish_set_images = []
    files_in_fish_set_GT = []
    for dirpath, dirs, files in os.walk(dataset_path): 
      for filename in fnmatch.filter(files, '*.png'):
        fname = os.path.join(dirpath,filename)
        if 'GT' in fname:
          files_in_fish_set_GT.append(fname)
        else:
          files_in_fish_set_images.append(fname)
    
    print(f'There are {len(files_in_fish_set_images)} pictures of fish and {len(files_in_fish_set_GT)} corresponding masks (GT)')
    

    ➡️ There are 9000 pictures of fish and 9000 corresponding masks (GT)

    So we have 18000 images in total, with half of them that are pictures of fished and the other half, the corresponding segmentation masks.

    We visualized the first pair image/mask in the dataset, using the Image package from Pillow:

    from PIL import Image
    

    First image and mask in the dataset — Image by author

    2. Send dataset to s3 bucket: AWS CLI

    We are sending the dataset to the s3 bucket first. We will use AWS Command Line Interface (AWS CLI) to perform this task, so let’s install this package using pip:

    # Install AWS CLI 
    !pip install --upgrade awscli
    

    Then you need to set up your AWS credentials:

    # Congigure AWS credentials 
    !aws configure
    

    In order to send the ASL dataset to the s3 bucket using this command:

    # Send dataset to s3 using aws CLI
    # We created a bucket "fish-classification-segmentation-dataset" in s3 ahead of time
    !aws s3 cp --recursive --no-progress --only-show-errors "./Fish_Dataset" "s3://fish-classification-segmentation-dataset"
    

    But we also want to evaluate the time that this command will take so we set up a timer that starts before running this line and stops at the end of the execution of this task:

    ➡️ Sending the dataset to s3 using AWS CLI took ~370 seconds.

    Then, if we check in our s3 bucket, we do see all the subfolders were uploaded:

    s3 Fish dataset bucket — Image by author

    3. Send dataset to s3 bucket: boto3

    Now we will test with a pretty commonly used package to communicate with AWS s3 called boto3.

    We need to import the following packages:

    import boto3
    from botocore.exceptions import NoCredentialsError
    from timeit import default_timer as timer
    from tqdm import tqdm
    

    Then we set up our AWS credentials variables:

    ACCESS_KEY = 'xxxxxx'
    SECRET_KEY = 'xxxxxx'
    AWS_SESSION_TOKEN='xxxxxx'
    

    Now we can use these credentials to connect to s3 using boto3.client:

    # Set up connection to s3 account with credentials provided by user
    s3_client = boto3.client('s3', aws_access_key_id=ACCESS_KEY,
                              aws_secret_access_key=SECRET_KEY, 
                              aws_session_token=AWS_SESSION_TOKEN)
    

    We then implement the function upload_to_aws that sends a file to an s3 bucket. It takes as inputs: the s3 client we just configured, the path to the local file to send to s3, the name of the s3 bucket we want the file to be sent to, and the name we want the file to be named in the s3 bucket:

    def upload_to_aws(s3_client, local_file, bucket, s3_file):
        # Try to upload file to s3 bucket and handle errors that can happens. 
        # Returns True if success, False otherwise
        try:
            s3_client.upload_file(local_file, bucket, s3_file)
            print("Upload Successful")
            return True
        except FileNotFoundError:
            print("The file was not found")
            return False
        except NoCredentialsError:
            print("Credentials not available")
            return False
    

    As said before, this function sends only one file to the s3 bucket, so we need to implement a loop that goes through all of the images in the dataset, and sends them to s3 — and we will also add a timer to log how much time this whole process took — so that we can then compare this time to the other times:

    # We created a bucket in s3 called fish-classification-segmentation-dataset-boto3
    bucket_name = "fish-classification-segmentation-dataset-boto3"
    
    # Setting up timer to perform benchmarking
    start = timer()
    
    for img_path in tqdm(files_in_fish_set):
      # We only want the file's name after "./Fish_Dataset/Fish_Dataset/"
      file_name_in_s3 = img_path.split(dataset_path)[1]
      # Upload file to the bucket 
      uploaded = upload_to_aws(s3_client, img_path, bucket_name, file_name_in_s3)
    
    # End timer
    end = timer()
    total_time_s3 = end-start
    
    print(f"Uploading all the files in the Fish dataset to s3 took {total_time_s3} seconds")
    

    And this is the result:

    Send dataset to s3 and log time — Image by author

    ➡️ Uploading all the files in the Fish dataset with boto3 to s3 took 3768.327020835 seconds.

    The script took approximately 63 minutes to execute. We can see here that using AWS CLI is much faster than boto3.

    Now, we can try to send the zipped dataset to s3 using boto3:

    # Upload the zipped dataset
    # Setting up timer to perform benchmarking
    start = timer()
    
    path_to_zipped_dataset = 'a-large-scale-fish-dataset.zip'
    uploaded = upload_to_aws(s3_client, path_to_zipped_dataset, bucket_name, path_to_zipped_dataset)
    
    # End timer
    end = timer()
    total_time_s3 = end-start
    
    print(f"Uploading the zipped Fish dataset to s3 took {total_time_s3} seconds")
    

    Result:

    Time taken by boto3 to send zipped dataset — Image by author

    ➡️ Sending the zipped dataset to s3 using boto3 took 40.643983771999956 seconds.

    This is a good way to send the zipped dataset to s3, however, the user will have to download and unzip the dataset each time they want to use or visualize it from the s3 bucket.

    4. Send dataset to Hub

    Now let’s send the dataset to Hub. First, here is how you can install Hub with pip:

    pip install hub==2.0.7
    

    NB: if using Google Colab, restart the runtime after running this line.

    Then, we import the modules we will need:

    import hub
    import numpy as np
    

    We also need to login to Activeloop:

    !activeloop login -u username -p password
    

    We are using the Hub storage, so we define a Hub path to the dataset that we are about to create and populate:

    hub_fish_path = "hub://margauxmforsythe/fish-dataset"
    

    We need to know the names of all the classes in the dataset so that they can be configured as labels in the Hub dataset:

    # Find the class_names 
    # we do not want the txt files to be included so we only look for the subfolders' names
    class_names = [name for name in os.listdir(dataset_path) if os.path.isdir(dataset_path + name)]
    
    print(f"There are {len(class_names)} classes: {class_names}")
    

    ➡️ There are 9 classes: [‘Striped Red Mullet’, ‘Black Sea Sprat’, ‘Trout’, ‘Hourse Mackerel’, ‘Shrimp’, ‘Sea Bass’, ‘Red Mullet’, ‘Gilt-Head Bream’, ‘Red Sea Bream’]

    So we have the 9 different species of fish.

    We can now send the dataset to Hub (i.e. in Hub format to Activeloop Cloud). In this dataset, each item will have an image, a mask and a label. And once again, we will use a timer to know how much time this command took so that we can compare it to the previous tests:

    # Setting up timer to perform benchmarking
    start = timer()
    
    # Uploading to Hub storage at the path: hub_asl_path
    with hub.empty(hub_fish_path, overwrite=True) as ds:
        # Create the tensors with names of your choice.
        ds.create_tensor('images', htype = 'image', sample_compression = 'png')
        ds.create_tensor('masks', htype = 'image', sample_compression = 'png')
        ds.create_tensor('labels', htype = 'class_label', class_names = class_names)
    
        # Add arbitrary metadata - Optional
        ds.info.update(description = 'Fish classification & Segmentation dataset')
        ds.images.info.update(camera_type = 'SLR')
    
        # Iterate through the files and append to hub dataset
        for i in tqdm(range(len(files_in_fish_set_images))):
            file_image = files_in_fish_set_images[i]
            file_mask = files_in_fish_set_GT[i]
            label_text = os.path.basename(os.path.dirname(file_image))
            label_num = class_names.index(label_text)
    
            # Append to images tensor using hub.read
            ds.images.append(hub.read(file_image))  
            # Append to masks tensor using hub.read
            ds.masks.append(hub.read(file_mask))  
            # Append to labels tensor
            ds.labels.append(np.uint32(label_num)) 
    
    # End timer
    end = timer()
    total_time_hub = end-start
    
    print(f"Uploading all the files in the Fish dataset to Hub took {total_time_hub} seconds")
    

    Dataset sent to Hub and logging time — Image by author

    ➡️ Uploading all the files in the Fish dataset to Hub took 1047.8884124840006 seconds.

    This took quite a long time. However, with Hub, we can use parallel computing to upload a dataset faster. Let’s try it!

    First, we implement the function file_to_hub that will run in parallel and that converts data from files (image, mask, label) into hub format:

    @hub.compute
    def file_to_hub(path_to_pair_img_mask, sample_out, class_names):
        file_image = path_to_pair_img_mask[0]
        file_mask = path_to_pair_img_mask[1]
    
        label_text = os.path.basename(os.path.dirname(file_image))
        label_num = class_names.index(label_text)
    
        # Append the label and image to the output sample
        sample_out.labels.append(np.uint32(label_num))
        sample_out.images.append(hub.read(file_image))
        sample_out.masks.append(hub.read(file_mask))
    
        return sample_out
    

    We defined file_to_hub such as it takes as input a list with two items: the path to the image and the path to the corresponding mask. Therefore, we need to create this list to be able to populate our dataset:

    # Creating a list that combined the paths to the pairs image/mask 
    list_pairs_img_mask = [[files_in_fish_set_images[i], files_in_fish_set_GT[i]] for i in range(len(files_in_fish_set_images))]
    

    So now we have the list list_pairs_img_mask that contains the paths to all the pairs image/mask, for example here is the first item of the list:

    list_pairs_img_mask[0]= [‘./Fish_Dataset/Fish_Dataset/Striped Red Mullet/Striped Red Mullet/00361.png’,
     ‘./Fish_Dataset/Fish_Dataset/Striped Red Mullet/Striped Red Mullet GT/00361.png’]
    

    We are ready to create our new dataset that we will call hub_fish_path_parallel_computing, available at the path ‘hub://margauxmforsythe/fish_dataset_parallel_computing’ using parallel computing (for this example, we use num_workers = 10 which should really speed up the process):

    hub_fish_path_parallel_computing = 'hub://margauxmforsythe/fish_dataset_parallel_computing'
    
    # Setting up timer to perform benchmarking
    start = timer()
    
    with hub.empty(hub_fish_path_parallel_computing, overwrite=True) as ds:
        ds.create_tensor('images', htype = 'image', sample_compression = 'png')
        ds.create_tensor('masks', htype = 'image', sample_compression = 'png')
        ds.create_tensor('labels', htype = 'class_label', class_names = class_names)
    
        file_to_hub(class_names=class_names).eval(list_pairs_img_mask, ds, num_workers = 10)
    
    # End timer
    end = timer()
    total_time_hub = end-start
    
    print(f"Uploading all the files in the Fish dataset to Hub with parallel computing took {total_time_hub} seconds")
    

    Time taken by Hub using parallel computing to upload Fish dataset — Image by authot

    ➡️ It only took 182.68960309999966 seconds to upload the Fish dataset to Hub when using parallel computing!

    Let’s take a look at our dataset:

    ds_from_hub_parallel_computing = hub.dataset(hub_fish_path_parallel_computing)
    
    list_random = [random.randint(0,len(ds_from_hub)) for i in range(8)]
    
    for i in list_random:
      print(f'\n{class_names[ds_from_hub_parallel_computing.labels[i].numpy()[0]]}')
      Image.fromarray(ds_from_hub_parallel_computing.images[i].numpy()).show()
    
      # grayscale mask
      Image.fromarray(ds_from_hub_parallel_computing.masks[i].numpy()*255).show()
    

    8 random images and their mask in hub_fish_path_parallel_computing — Image by author

    We now have an easy access to the entire organized dataset where each image is stored with its segmentation mask and label! The dataset is available at the public URL: ‘hub://margauxmforsythe/fish_dataset_parallel_computing’.

    Here are our final benchmarking’s results on the time taken to upload the full unzipped Fish dataset to the Cloud:

    • AWS CLI: 369.0134469759996 seconds

    • boto3 — full unzipped dataset: 3768.327020835 seconds

    • Hub: 1047.8884124840006 seconds

    • Hub with parallel computing: 182.68960309999966 seconds

    Uploading the entire dataset to Hub using parallel computing was 2 times faster than AWS CLI and ~20 times faster than boto3!

    Source: [https://tenor.com/search/i-am-speed-gifs](https://tenor.com/search/i-am-speed-gifs)

    So now we have a clean, organized, and easy to access dataset in Hub, and it that took only a few seconds to upload 🏁 We are ready to use this dataset to train a classification/segmentation model!

    And this is how you can speed up your Machine Learning Workflow at the very first step! ⏱

    The notebook for this tutorial is available on Google Colab.

    • Previous
        • Tutorials
      • Binary Semantic Segmentation: Cloud detection with U-net and Activeloop Hub

      • on Aug 17, 2021
    • Next
        • Blog
      • Faster Machine Learning using Hub by Activeloop: Code Walkthrough

      • on Nov 18, 2020
cubescubescubescubescubescubes
  • Subscribe to our newsletter for more articles like this
  • deep lake database

    Deep Lake. Database for AI.

    • Solutions
      AgricultureAudio ProcessingAutonomous Vehicles & RoboticsBiomedical & HealthcareMultimediaSafety & Security
    • Company
      AboutContact UsCareersPrivacy PolicyDo Not SellTerms & Conditions
    • Resources
      BlogDocumentationDeep Lake WhitepaperDeep Lake Academic PaperHumans in the Loop Podcast
  • Tensie

    Featured by

    featuredfeaturedfeaturedfeatured