• ActiveLoop
    • Solutions
      Industries
      • agriculture
        Agriculture
      • audio proccesing
        Audio Processing
      • autonomous_vehicles
        Autonomous & Robotics
      • biomedical_healthcare
        Biomedical & Healthcare
      • generative_ai_and_rag
        Generative AI & RAG
      • multimedia
        Multimedia
      • safety_security
        Safety & Security
      Case Studies
      Enterprises
      BayerBiomedical

      Chat with X-Rays. Bye-bye, SQL

      MatterportMultimedia

      Cut data prep time by up to 80%

      Flagship PioneeringBiomedical

      +18% more accurate RAG

      MedTechMedTech

      Fast AI search on 40M+ docs

      Generative AI
      Hercules AIMultimedia

      100x faster queries

      SweepGenAI

      Serverless DB for code assistant

      Ask RogerGenAI

      RAG for multi-modal AI assistant

      Startups
      IntelinairAgriculture

      -50% lower GPU costs & 3x faster

      EarthshotAgriculture

      5x faster with 4x less resources

      UbenwaAudio

      2x faster data preparation

      Tiny MileRobotics

      +19.5% in model accuracy

      Company
      Company
      about
      About
      Learn about our company, its members, and our vision
      Contact Us
      Contact Us
      Get all of your questions answered by our team
      Careers
      Careers
      Build cool things that matter. From anywhere
      Docs
      Resources
      Resources
      blog
      Blog
      Opinion pieces & technology articles
      langchain
      LangChain
      LangChain how-tos with Deep Lake Vector DB
      tutorials
      Tutorials
      Learn how to use Activeloop stack
      glossary
      Glossary
      Top 1000 ML terms explained
      news
      News
      Track company's major milestones
      release notes
      Release Notes
      See what's new?
      Academic Paper
      Deep Lake Academic Paper
      Read the academic paper published in CIDR 2023
      White p\Paper
      Deep Lake White Paper
      See how your company can benefit from Deep Lake
      Free GenAI CoursesSee all
      LangChain & Vector DBs in Production
      LangChain & Vector DBs in Production
      Take AI apps to production
      Train & Fine Tune LLMs
      Train & Fine Tune LLMs
      LLMs from scratch with every method
      Build RAG apps with LlamaIndex & LangChain
      Build RAG apps with LlamaIndex & LangChain
      Advanced retrieval strategies on multi-modal data
      Pricing
  • Book a Demo
How to create collaborative Machine Learning datasets for projects gathering 50+ collaborators
    • Back
      • Tutorials

    How to create collaborative Machine Learning datasets for projects gathering 50+ collaborators

    How can one create a collaborative environment to foster innovation in Machine Learning teams? Learn how Margaux used Activeloop Hub with 50+ collaborators to increase food security in Senegal.
    • Margaux Masson-ForsytheMargaux Masson-...
    6 min readon Jun 16, 2021Updated Apr 20, 2022
  • The common keys to successful projects are efficient collaborative work and good communication lead by a diverse team. At Omdena, every project gathers 50+ collaborators from all around the world who all work together in order to develop innovative, ethical, and useful AI solutions in two months. Each project tackles issues like climate change, fake news, food insecurity, online threats, disease spread, bank security, and more.

    Therefore, collaborators need to have access to collaborative Machine Learning datasets that are easily accessible to all, so that the progress of the project is not delayed by dataset’s related issues.

    In this article, I will show how we used collaborative Machine Learning datasets for the Omdena GPSDD project.

    Photo by Ant Rozetsky on Unsplash

    Project

    The Omdena project’s Improving Food Security and Crop Yield in Senegal was a collaboration with the Global Partnership for Sustainable Development Data (GPSDD). The goal was to use machine learning in order to increase food security in Senegal. With this goal in mind, this project had several objectives tackling various areas connected to food insecurity, such as crop yield, climate risk, crop diseases, deforestation, and food storage/transport.

    Problem Statement

    We needed a solution on how to handle the datasets for the crop yield prediction subtask that consisted of analyzing satellite data and field data to estimate yields at different levels.
    structure

    Summary of project’s structure — Source: Omdena

    So, we had several issues:

    • Raw satellite images are too heavy to be easily stored and accessible to all collaborators
    • The Deep Learning (DL) models developed use preprocessed satellite images
    • The preprocessed data are dependent on the crop type and so need to be carefully prepared
    • We had one dataset per country studied and crop type (Maize, Rice, Millet) — which was a choice
    • The models are trained for each crop type and take specific size input according to the latter
    • The training datasets (preprocessed data + ground truth) cannot easily be stored/accessed from AWS or Github — especially that we trained the models on Google Colab

    We solved most of these issues by using Activeloop.

    Activeloop is a fast and simple framework for building and scaling data pipelines for machine learning.

    Datasets description

    For the GPSDD Senegal project, we use Activeloop to store the datasets used for training our Deep Learning models. In these datasets, we had the 32-bins histograms of satellite images, the normalized difference vegetation indexes (ndvi), and yields (ground truth) values originally saved locally as npy files. The resulting Activeloop Dataset’s schema was then:

    1ds = Dataset(tag,shape=(histograms.shape[0],),
    2schema = { 
    3    "histograms": schema.Tensor(histograms[0].shape, dtype="float"),
    4    "ndvi": schema.Tensor(ndvi[0].shape, dtype="float"),
    5    "yields": schema.Tensor(shape=(1,), dtype="float"),
    6    }
    7,mode="w+",)
    8

    Ground Truth

    In order to be able to add the ground truth yields values in the Activeloop dataset, we had to save them as a list of lists of the values as followed:

    yields_list = [[yield_1], [yield_2], [yield_3], …, [yield_n]]

    Use case: storing/combining several datasets for each country

    We had several datasets with data from different countries that we wanted to store in separate datasets because we sometimes used them individually, and sometimes combined them.

    For example, in order to perform transfer learning with the crop yield prediction model, we used the datasets from South Sudan and Ethiopia to first train the model, then used the pre-trained model obtained and fine-tuned it using the combined datasets from South Sudan, Ethiopia, and Senegal this time.

    Using Activeloop for this purpose made it easier and cleaner. Each dataset was loaded from the Activeloop hub using its unique path and then we were able to combine the datasets easily.

    For example:

    1tag1 = "username/SouthSudan_dataset"
    2tag2 = "username/Ethiopia_dataset"
    3tag3 = "username/Senegal_dataset"
    4
    5ds1 = Dataset(tag1)
    6ds2 = Dataset(tag2)
    7ds3 = Dataset(tag3)
    8
    9print(f"Dataset {tag1} shape: {ds1['histograms'].compute().shape}")
    10print(f"Dataset {tag2} shape: {ds2['histograms'].compute().shape}")
    11print(f"Dataset {tag3} shape: {ds3['histograms'].compute().shape}")
    12
    13histograms = np.concatenate(
    14    (
    15        ds1["histograms"].compute(),
    16        ds2["histograms"].compute(), 
    17        ds3["histograms"].compute()
    18    ),
    19    axis=0)
    20
    21yields_list = np.concatenate(
    22    (
    23        ds1["yields"].compute(), 
    24        ds2["yields"].compute(), 
    25        ds3["yields"].compute()
    26    ),
    27    axis=0)
    28
    29print(f"Datasets combined, histograms set's shape is {histograms.shape}")
    30print(f"Data loaded from {tag1}, {tag2} and {tag3}")
    31

    Training

    When we used np.concatenate to combine the three datasets, we used the tf.data.Dataset.from_tensor_slices module to convert the lists to tensorflow tensors:

    1list_ds = tf.data.Dataset.from_tensor_slices((histograms, yields))
    2image_count = histograms.shape[0]
    3

    But when we worked only with one of the datasets, we directly used the feature “convert to Tensorflow” from Activeloop ds.to_tensorflow():

    1def to_model_fit(item):
    2    x = item["histograms"]
    3    y = item["yields"]
    4    return (x, y)
    5
    6list_ds = ds1.to_tensorflow()
    7list_ds = list_ds.map(lambda x: to_model_fit(x))
    8image_count = ds1["histograms"].compute().shape[0]
    9

    Here is a nice example of how to directly use Activeloop datasets to train with Tensorflow.

    Then we split the data into train, validation, and test sets using the skip and take functions. Once we had the three sets, we batched, shuffled, and cached them using the Tensorflow functions.

    1batch_size = 16
    2
    3print("Total files: {}".format(image_count))
    4train_size = int(0.8 * image_count)
    5val_size = int(0.1 * image_count)
    6test_size = int(0.1 * image_count)
    7
    8list_ds = list_ds.shuffle(image_count)
    9test_ds = list_ds.take(test_size)
    10train_ds = list_ds.skip(test_size)
    11val_ds = list_ds.take(val_size)
    12train_ds = list_ds.skip(val_size)
    13
    14train_ds = train_ds.shuffle(train_size)
    15train_ds = train_ds.batch(batch_size)
    16
    17val_ds = val_ds.shuffle(val_size)
    18val_ds = val_ds.batch(val_size)
    19
    20test_ds = test_ds.batch(test_size)
    21

    And finally, we trained our CNN model using these common lines of commands:

    1metrics_list = [
    2    'accuracy',
    3    tf.keras.metrics.RootMeanSquaredError(name='RMSE'),
    4    tf.keras.losses.MeanSquaredError(name='MSE')
    5    ]
    6
    7model.compile(
    8    optimizer=tf.keras.optimizers.Adam(),
    9    loss=tf.keras.losses.MeanSquaredError(),
    10    metrics=metrics_list
    11    )
    12
    13model.fit(train_ds,
    14          epochs=1,
    15          validation_data=val_ds,
    16          verbose=1)
    17
    18

    Consistency of path and data

    Another advantage of using Activeloop in this project was that the paths to the datasets were accessible to all developers without them having to store the data locally, and we were also certain that all developers were using the same dataset as well.

    Update a dataset

    It was also easy to replace a dataset by re-uploading the updated dataset to the same tag which was really useful when we were able to collect and process more data and had to update the datasets for the trainings. All the trainings were done in Google Colab Notebooks, all using Activeloop stored datasets. The import step, therefore, consisted only in loading the Dataset using the code above, which imports all the data at once and not each file at a time which might require a dataloader class/function in some cases.

    Discussion

    We could have used Activeloop to store the satellite images but decided instead to store them in the S3 bucket so that the raw data would be in the bucket and the pre-processed ready-to-use datasets in Activeloop. So all the pre-processing of the data that lead to the histograms was done on a local machine with satellite images downloaded locally.

    The way we store the ground truth yield values can probably be improved by using the available schemas more efficiently.

    To conclude, in this project we used Activeloop as a way to efficiently and easily store the ML datasets, and have a unique and consistent path for all collaborators.

    Share:

    • Table of Contents
    • Project
    • Problem Statement
    • Datasets description
    • Ground Truth
    • Use case: storing/combining several datasets for each country
    • Training
    • Consistency of path and data
    • Update a dataset
    • Discussion
    • Previous
        • Tutorials
      • Lacking Good Computer Vision Benchmark Datasets Is a Problem-Let's Fix That!

      • on Jun 14, 2021
    • Next
        • News
      • Activeloop Hub. Revolutionising Data Storage and Data Pre-processing

      • on Jun 16, 2021

Related Articles

Machine learning engineers work less on machine learning and more on data preparation. In fact, a typical ML engineer spends more than 50% of their time preprocessing the data, rather than analyzing it.
    • Blog
Faster Machine Learning using Hub by Activeloop: Code WalkthroughNov 18, 2020
Hot Dog Not Hot Dog - that is the question! Saurav utilizes the power of Weights & Biases and Hub in this tasty tale of computer vision best practices.
    • Tutorials
Weights & Biases and Hub - best practices for tasty classification models for computer visionMay 19, 2021
HDF5 file format is one of the most popular dataset formats out there. However, it's not optimized for deep learning tasks. In this article, Margaux contrasts the performance of Hub vs HDF5 format, and explores why it is better to use Hub for CV tasks.
    • Blog
HDF5 (Hierarchical Data Format 5) vs Hub. Creating performant Computer Vision datasetsSep 28, 2021
  • deep lake database

    Deep Lake. Database for AI.

    • Solutions
      AgricultureAudio ProcessingAutonomous Vehicles & RoboticsBiomedical & HealthcareMultimediaSafety & Security
    • Company
      AboutContact UsCareersPrivacy PolicyDo Not SellTerms & Conditions
    • Resources
      BlogDocumentationDeep Lake WhitepaperDeep Lake Academic Paper
  • Tensie

    Featured by

    featuredfeaturedfeaturedfeatured