• ActiveLoop
    • Solutions

      INDUSTRIES

      • agricultureAgriculture
        agriculture_technology_agritech
      • audioAudio Processing
        audio_processing
      • roboticsAutonomous & Robotics
        autonomous_vehicles
      • biomedicalBiomedical & Healthcare
        Biomedical_Healthcare
      • multimediaMultimedia
        multimedia
      • safetySafety & Security
        safety_security

      CASE STUDIES

      • IntelinAir
      • Learn how IntelinAir generates & processes datasets from petabytes of aerial imagery at 0.5x the cost

      • Earthshot Labs
      • Learn how Earthshot increased forest inventory management speed 5x with a mobile app

      • Ubenwa
      • Learn how Ubenwa doubled ML efficiency & improved scalability for sound-based diagnostics

      ​

      • Sweep
      • Learn how Sweep powered their code generation assistant with serverless and scalable data infrastructure

      • AskRoger
      • Learn how AskRoger leveraged Retrieval Augmented Generation for their multimodal AI personal assistant

      • TinyMile
      • Enhance last mile delivery robots with 10x quicker iteration cycles & 30% lower ML model training cost

      Company
      • About
      • Learn about our company, its members, and our vision

      • Contact Us
      • Get all of your questions answered by our team

      • Careers
      • Build cool things that matter. From anywhere

      Docs
      Resources
      • blogBlog
      • Opinion pieces & technology articles

      • tutorialTutorials
      • Learn how to use Activeloop stack

      • notesRelease Notes
      • See what's new?

      • newsNews
      • Track company's major milestones

      • langchainLangChain
      • LangChain how-tos with Deep Lake Vector DB

      • glossaryGlossary
      • Top 1000 ML terms explained

      • deepDeep Lake Academic Paper
      • Read the academic paper published in CIDR 2023

      • deepDeep Lake White Paper
      • See how your company can benefit from Deep Lake

      Pricing
  • Log in
How to create collaborative Machine Learning datasets for projects gathering 50+ collaborators
    • Back
      • Tutorials

    How to create collaborative Machine Learning datasets for projects gathering 50+ collaborators

    How can one create a collaborative environment to foster innovation in Machine Learning teams? Learn how Margaux used Activeloop Hub with 50+ collaborators to increase food security in Senegal.
    • Margaux Masson-Forsythe

      Margaux Masson-Forsythe

      on Jun 16, 20216 min read

    • Upvotes: 0

    • Share:

  • The common keys to successful projects are efficient collaborative work and good communication lead by a diverse team. At Omdena, every project gathers 50+ collaborators from all around the world who all work together in order to develop innovative, ethical, and useful AI solutions in two months. Each project tackles issues like climate change, fake news, food insecurity, online threats, disease spread, bank security, and more.

    Therefore, collaborators need to have access to collaborative Machine Learning datasets that are easily accessible to all, so that the progress of the project is not delayed by dataset’s related issues.

    In this article, I will show how we used collaborative Machine Learning datasets for the Omdena GPSDD project.

    Photo by Ant Rozetsky on Unsplash

    Project

    The Omdena project’s Improving Food Security and Crop Yield in Senegal was a collaboration with the Global Partnership for Sustainable Development Data (GPSDD). The goal was to use machine learning in order to increase food security in Senegal. With this goal in mind, this project had several objectives tackling various areas connected to food insecurity, such as crop yield, climate risk, crop diseases, deforestation, and food storage/transport.

    Problem Statement

    We needed a solution on how to handle the datasets for the crop yield prediction subtask that consisted of analyzing satellite data and field data to estimate yields at different levels.
    structure

    Summary of project’s structure — Source: Omdena

    So, we had several issues:

    • Raw satellite images are too heavy to be easily stored and accessible to all collaborators
    • The Deep Learning (DL) models developed use preprocessed satellite images
    • The preprocessed data are dependent on the crop type and so need to be carefully prepared
    • We had one dataset per country studied and crop type (Maize, Rice, Millet) — which was a choice
    • The models are trained for each crop type and take specific size input according to the latter
    • The training datasets (preprocessed data + ground truth) cannot easily be stored/accessed from AWS or Github — especially that we trained the models on Google Colab

    We solved most of these issues by using Activeloop.

    Activeloop is a fast and simple framework for building and scaling data pipelines for machine learning.

    Datasets description

    For the GPSDD Senegal project, we use Activeloop to store the datasets used for training our Deep Learning models. In these datasets, we had the 32-bins histograms of satellite images, the normalized difference vegetation indexes (ndvi), and yields (ground truth) values originally saved locally as npy files. The resulting Activeloop Dataset’s schema was then:

    1
    2
    3
    4
    5
    6
    7
    ds = Dataset(tag,shape=(histograms.shape[0],),
    schema = { 
        "histograms": schema.Tensor(histograms[0].shape, dtype="float"),
        "ndvi": schema.Tensor(ndvi[0].shape, dtype="float"),
        "yields": schema.Tensor(shape=(1,), dtype="float"),
        }
    ,mode="w+",)
    

    Ground Truth

    In order to be able to add the ground truth yields values in the Activeloop dataset, we had to save them as a list of lists of the values as followed:

    yields_list = [[yield_1], [yield_2], [yield_3], …, [yield_n]]

    Use case: storing/combining several datasets for each country

    We had several datasets with data from different countries that we wanted to store in separate datasets because we sometimes used them individually, and sometimes combined them.

    For example, in order to perform transfer learning with the crop yield prediction model, we used the datasets from South Sudan and Ethiopia to first train the model, then used the pre-trained model obtained and fine-tuned it using the combined datasets from South Sudan, Ethiopia, and Senegal this time.

    Using Activeloop for this purpose made it easier and cleaner. Each dataset was loaded from the Activeloop hub using its unique path and then we were able to combine the datasets easily.

    For example:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    tag1 = "username/SouthSudan_dataset"
    tag2 = "username/Ethiopia_dataset"
    tag3 = "username/Senegal_dataset"
    
    ds1 = Dataset(tag1)
    ds2 = Dataset(tag2)
    ds3 = Dataset(tag3)
    
    print(f"Dataset {tag1} shape: {ds1['histograms'].compute().shape}")
    print(f"Dataset {tag2} shape: {ds2['histograms'].compute().shape}")
    print(f"Dataset {tag3} shape: {ds3['histograms'].compute().shape}")
    
    histograms = np.concatenate(
        (
            ds1["histograms"].compute(),
            ds2["histograms"].compute(), 
            ds3["histograms"].compute()
        ),
        axis=0)
    
    yields_list = np.concatenate(
        (
            ds1["yields"].compute(), 
            ds2["yields"].compute(), 
            ds3["yields"].compute()
        ),
        axis=0)
    
    print(f"Datasets combined, histograms set's shape is {histograms.shape}")
    print(f"Data loaded from {tag1}, {tag2} and {tag3}")
    

    Training

    When we used np.concatenate to combine the three datasets, we used the tf.data.Dataset.from_tensor_slices module to convert the lists to tensorflow tensors:

    1
    2
    list_ds = tf.data.Dataset.from_tensor_slices((histograms, yields))
    image_count = histograms.shape[0]
    

    But when we worked only with one of the datasets, we directly used the feature “convert to Tensorflow” from Activeloop ds.to_tensorflow():

    1
    2
    3
    4
    5
    6
    7
    8
    def to_model_fit(item):
        x = item["histograms"]
        y = item["yields"]
        return (x, y)
    
    list_ds = ds1.to_tensorflow()
    list_ds = list_ds.map(lambda x: to_model_fit(x))
    image_count = ds1["histograms"].compute().shape[0]
    

    Here is a nice example of how to directly use Activeloop datasets to train with Tensorflow.

    Then we split the data into train, validation, and test sets using the skip and take functions. Once we had the three sets, we batched, shuffled, and cached them using the Tensorflow functions.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    batch_size = 16
    
    print("Total files: {}".format(image_count))
    train_size = int(0.8 * image_count)
    val_size = int(0.1 * image_count)
    test_size = int(0.1 * image_count)
    
    list_ds = list_ds.shuffle(image_count)
    test_ds = list_ds.take(test_size)
    train_ds = list_ds.skip(test_size)
    val_ds = list_ds.take(val_size)
    train_ds = list_ds.skip(val_size)
    
    train_ds = train_ds.shuffle(train_size)
    train_ds = train_ds.batch(batch_size)
    
    val_ds = val_ds.shuffle(val_size)
    val_ds = val_ds.batch(val_size)
    
    test_ds = test_ds.batch(test_size)
    

    And finally, we trained our CNN model using these common lines of commands:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    metrics_list = [
        'accuracy',
        tf.keras.metrics.RootMeanSquaredError(name='RMSE'),
        tf.keras.losses.MeanSquaredError(name='MSE')
        ]
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(),
        loss=tf.keras.losses.MeanSquaredError(),
        metrics=metrics_list
        )
    
    model.fit(train_ds,
              epochs=1,
              validation_data=val_ds,
              verbose=1)
    
    

    Consistency of path and data

    Another advantage of using Activeloop in this project was that the paths to the datasets were accessible to all developers without them having to store the data locally, and we were also certain that all developers were using the same dataset as well.

    Update a dataset

    It was also easy to replace a dataset by re-uploading the updated dataset to the same tag which was really useful when we were able to collect and process more data and had to update the datasets for the trainings. All the trainings were done in Google Colab Notebooks, all using Activeloop stored datasets. The import step, therefore, consisted only in loading the Dataset using the code above, which imports all the data at once and not each file at a time which might require a dataloader class/function in some cases.

    Discussion

    We could have used Activeloop to store the satellite images but decided instead to store them in the S3 bucket so that the raw data would be in the bucket and the pre-processed ready-to-use datasets in Activeloop. So all the pre-processing of the data that lead to the histograms was done on a local machine with satellite images downloaded locally.

    The way we store the ground truth yield values can probably be improved by using the available schemas more efficiently.

    To conclude, in this project we used Activeloop as a way to efficiently and easily store the ML datasets, and have a unique and consistent path for all collaborators.

    • Previous
        • Tutorials
      • Lacking Good Computer Vision Benchmark Datasets Is a Problem-Let's Fix That!

      • on Jun 14, 2021
    • Next
        • News
      • Activeloop Hub. Revolutionising Data Storage and Data Pre-processing

      • on Jun 16, 2021
cubescubescubescubescubescubes
  • Subscribe to our newsletter for more articles like this
  • deep lake database

    Deep Lake. Database for AI.

    • Solutions
      AgricultureAudio ProcessingAutonomous Vehicles & RoboticsBiomedical & HealthcareMultimediaSafety & Security
    • Company
      AboutContact UsCareersPrivacy PolicyDo Not SellTerms & Conditions
    • Resources
      BlogDocumentationDeep Lake WhitepaperDeep Lake Academic PaperHumans in the Loop Podcast
  • Tensie

    Featured by

    featuredfeaturedfeaturedfeatured