• ActiveLoop
    • Solutions
      Industries
      • agriculture
        Agriculture
      • audio proccesing
        Audio Processing
      • autonomous_vehicles
        Autonomous & Robotics
      • biomedical_healthcare
        Biomedical & Healthcare
      • generative_ai_and_rag
        Generative AI & RAG
      • multimedia
        Multimedia
      • safety_security
        Safety & Security
      Case Studies
      Enterprises
      BayerBiomedical

      Chat with X-Rays. Bye-bye, SQL

      MatterportMultimedia

      Cut data prep time by up to 80%

      Flagship PioneeringBiomedical

      +18% more accurate RAG

      MedTechMedTech

      Fast AI search on 40M+ docs

      Generative AI
      Hercules AIMultimedia

      100x faster queries

      SweepGenAI

      Serverless DB for code assistant

      Ask RogerGenAI

      RAG for multi-modal AI assistant

      Startups
      IntelinairAgriculture

      -50% lower GPU costs & 3x faster

      EarthshotAgriculture

      5x faster with 4x less resources

      UbenwaAudio

      2x faster data preparation

      Tiny MileRobotics

      +19.5% in model accuracy

      Company
      Company
      about
      About
      Learn about our company, its members, and our vision
      Contact Us
      Contact Us
      Get all of your questions answered by our team
      Careers
      Careers
      Build cool things that matter. From anywhere
      Docs
      Resources
      Resources
      blog
      Blog
      Opinion pieces & technology articles
      langchain
      LangChain
      LangChain how-tos with Deep Lake Vector DB
      tutorials
      Tutorials
      Learn how to use Activeloop stack
      glossary
      Glossary
      Top 1000 ML terms explained
      news
      News
      Track company's major milestones
      release notes
      Release Notes
      See what's new?
      Academic Paper
      Deep Lake Academic Paper
      Read the academic paper published in CIDR 2023
      White p\Paper
      Deep Lake White Paper
      See how your company can benefit from Deep Lake
      Free GenAI CoursesSee all
      LangChain & Vector DBs in Production
      LangChain & Vector DBs in Production
      Take AI apps to production
      Train & Fine Tune LLMs
      Train & Fine Tune LLMs
      LLMs from scratch with every method
      Build RAG apps with LlamaIndex & LangChain
      Build RAG apps with LlamaIndex & LangChain
      Advanced retrieval strategies on multi-modal data
      Pricing
  • Book a Demo
Introducing Deep Lake, the Data Lake for Deep Learning
    • Back
      • News

    Introducing Deep Lake, the Data Lake for Deep Learning

    In our biggest launch yet, we're introducing Deep Lake - the data lake for deep learning. Deep Lake is more than 2x more performant than its previous generation, and surpasses all other data loaders. Read to learn more about Deep Lake's features.
    • Davit BuniatyanDavit Buniatyan
    10 min readon Sep 30, 2022Updated Apr 21, 2023
  • Executive Summary

    One of three ML projects fails due to the lack of a solid data foundation. Projects suffer from low-quality data, under-utilized compute resources, and significant labor overhead required to build and maintain large amounts of data. Traditional data lakes break down data silos for analytical workloads, enable data-driven decision-making, improve operational efficiency, and reduce organizational costs. However, most of these benefits are unavailable for deep learning workloads such as natural language processing (NLP), audio processing, computer vision, agriculture, healthcare, multimedia, and robotics/automotive, and safety & security verticals. Hence repeatedly, organizations opt-in to develop in-house systems.

    Deep Lake main picture

    Deep Lake maintains the benefits of a vanilla data lake and enables you to iterate on your deep learning models 2x faster without teams spending time building complex data infrastructure.

    Deep Lake stores complex data, such as images, videos, annotations, embeddings, and tabular data, in the form of tensors and rapidly streams the data over the network to Tensor Query Language, in-browser visualization engine, and deep learning frameworks without sacrificing GPU utilization. As deep learning rapidly takes over traditional computational pipelines, storing datasets in a Deep Lake will become the new norm

    Behind the Scenes at Activeloop

    In 2016, before starting the company, I started my Ph.D. research at the Connectomics lab at Princeton Neuroscience Institute. I have witnessed the transition from a gigabyte to a terabyte, then to petabyte-scale datasets to achieve super-human accuracy in reconstructing neural connections inside a brain in just several years. Our problem was to figure out how to optimize and cut the cost 4-5x by rethinking how the data is stored, streamed from storage to the compute, which models to use, and how to compile them and run them at scale. While the industry moved slowly, we have observed how similar patterns repeat themselves on a much larger scale.

    We started Activeloop (formerly known as Snark AI) as part of the Y Combinator Summer 2018 batch to enable organizations to be more efficient at deploying deep learning solutions. We helped build a large language model for patents for a legal tech startup and streamable data pipelines for the petabyte-scale machine learning use case in AgriTech. Through trial and error and talking to hundreds of companies, we learned that all the awesome databases, data warehouses, and data lakes (joined by lakehouses) are great at analytical workloads but not as much for deep learning applications. The demand for storing unstructured data such as audio, video, and images has exploded over the years (more than 90% of the data is now generated in unstructured form). We knew that building the database for AI, the solution to store it, was the proper challenge for us.

    In 2020, we open-sourced the dataset format called “Hub”, which enabled storing images, videos, and audio as chunked arrays on objects storages and connecting to deep learning frameworks such as PyTorch or Tensorflow. We have collaborated with teams from Google AI, Waymo, Oxford University, Yale University, and other deep learning groups to figure out the nuts and bolts of a solid data infrastructure for deep learning applications.

    In 2021, the open-source project trended #1 in Python & #2 across all GitHub repositories and was even named as one of the top 10 python ML packages. As of writing this post, the project has 4.8K stars, 75+ contributors, and +1K community members. It is in production both at research institutions, startups, and public companies.

    We also released the managed version of Activeloop that lets you visualize datasets, version-control or query them, and stream to deep learning frameworks. Apart from providing access to 125+ machine learning datasets, it enables sharing private datasets and collaboration on building and maintaining datasets across organizations. Of course, I couldn’t be more proud of our small and under-resourced team for achieving results in such a short time, but the industry has been innovating at a staggering speed.

    Large Foundational Models Taking Over by Storm

    Deep learning achieved super-human accuracy in applications across industries in a few years. Cancer detection from X-Ray images, anatomical reconstruction of human neural cells, playing highly complex games such as Dota or Go, driving cars, unfolding proteins, having human-like conversations, generating code, and even realistic images that took the internet by storm (it took about 40 words to create the perfect prompt, but AI generated the stunning title image of this post). Three factors enable this speed: (1) novel architectures such as Transformers, (2) massive compute capabilities using GPUs or TPUs, and the large volume of datasets such as ImageNet, CommonCrawl, and LAION-400M.

    At Activeloop, we firmly believe that connecting deep learning models to the value chain in the next five years will produce a foundational shift in the global economy. While innovators primarily focused on models and computing hardware, maintaining or streamlining the complex data infrastructure has been an afterthought. In the build versus buy dilemma, organizations (for the lack of a “buy” option) repeatably build hard-to-manage in-house solutions. All this led us to decide on the next chapter for the company - Deep Lake.

    Activeloop Deep Lake and autonomous vehicles

    Introduction to Deep Lake, the Data Lake for Deep Learning

    What is the Deep Lake?

    ML cycle with Deep Lake

    Deep Lake is a vanilla data lake for deep learning, but with one key difference. Deep Lake stores complex data, such as images, audio, videos, annotations, embeddings, and tabular data, in the form of tensors and rapidly streams the data over the network to Tensor Query Language, an in-browser visualization engine, or deep learning frameworks without sacrificing GPU utilization.

    Deep Lake provides key features that make it the optimal data storage platform for deep learning applications, including:

    • A scalable and efficient data storage system that can handle large amounts of complex data in a columnar fashion
    • Querying and visualization engine to fully support multimodal data types
    • Native integration with deep learning frameworks and efficient streaming of data to models and back
    • Seamless connection with MLOps tools.

    Machine Learning Loop with Deep Lake

    These are five fundamental pillars of Deep Lake.

    1. Version Control: Git for data
    2. Visualize: In-browser visualization engine
    3. Query: Rapid queries with Tensor Query language
    4. Materialize: Format native to deep learning
    5. Stream: Streaming Data Loaders

    We discuss those features in-depth in the Deep Lake White Paper and shed light on how it works in the Academic Paper.

    Deep Lake and the Data Loader Landscape

    Data loaders are one of the more significant bottlenecks of machine learning pipelines (Mohan et al., 2020), and we’ve built Deep Lake to specifically resolve the data to compute handoff bottleneck.

    graph 1

    Comparison with AWS & Deep Lake

    We are thankful to Ofeidis, Kiedanski, & Tassiulas from Yale Institute For Network Science, who have spent a lot of time producing an independent, extensive survey, & benchmarking of open-source data loaders. The research concluded that the third major iteration of our product, Deep Lake, is not only 2x faster than the previous version but is superior to other data loaders in various scenarios.

    Dataloader comparison 1

    *Comparing the performance of Activeloop Hub, Deep Lake, and Webdataset when loading data from different locations: Local, AWS, and MinIO. (Ofedis et al. 2022)

    Dataloaders 2

    *Speed as a function of the number of workers for RAN- DOM on a single GPU. (Ofedis et al. 2022)

    The reasoning behind some of Deep Lake’s architectural decisions

    Naturally, it took a lot of thinking and iteration cycles to arrive at the way Deep Lake is architected - and here are a few of considerations we’ve had.

    Where does Deep Lake fit in the MLOps?

    As numerous MLOps tools get into the market, it becomes hard for buyers to understand the landscape. We collaborated with the AI Infrastructure Alliance to craft the new MLOps blueprint that provides a clear overview across tools. The blueprint goes bottom-up from infrastructure to human interface and left-to-right from ingestion to development. In the blueprint, Deep Lake takes on the role of a solid data foundation.

    aiia

    Why we renamed Hub to Deep Lake?

    Originally, Hub was a chunked array format that naturally evolved with version control, streaming engine, and query capabilities. Our broad open-source community - users from companies, startups, and academia were instrumental in iterating on the product. Increasingly, we found the name too generic of a descriptor (or, as one of our team members put it, “everyone has a Hub nowadays”). Often, it would cause confusion with dataset hubs. Internally, we were calling it a “deep lake” (or named it after the deepest lakes in the world). We were delighted to see people like A. Pinhassi also think in the same direction. Overnight, calling the tool, we’re building “deeplake” instead of "hub", which felt just right (although our marketing department wasn’t too thrilled on account of freshly-ordered swag with the Hub branding).

    pip3 install deeplake
    

    Is there a Deep Lakehouse, and where does it come into place?

    The format includes versioning, and lineage is fully open-source. Query, streaming, and visualization engines are built in C++ and are closed source for the time being. Nonetheless, they are accessible via Python interface for all users. Being committed to open-source principles, we plan to open-source high-performance engines as they commoditize.

    Deep Lake Architecture

    Does Deep Lake connect to the Modern Data Stack and MLOps tools?

    The Deep Lake Airbyte destination allows ingesting a dataset from vast amounts of data sources. On the MLOps side, we have been collaborating with W&Bs, Heartex LabelStudio, Sama, CleanLab, AimStack, and Anyscale Ray to provide seamless integrations, which we are going to release in subsequent blog posts.

    The MLOps Ecosystem and Deep Lake

    What’s next for Deep Lake?

    As Deep Lake evolves, we will continuously optimize the performance, add a custom data sampler, sub-tile queries for constructing complex datasets for the 3.1.0 release, performant TensorFlow support, and ACID transactions scheduled for the 3.2.0 release (watch our GitHub repo to stay tuned).

    We believe that the next step for AI research is to capture text, audio, images, and videos by large multi-modal foundational models. Just think about how many days it took to get to Dall-E and how many took from that milestone to Stable Diffusion or Make-A-Video by Meta AI. Having a solid data infrastructure is going to be a necessary condition for delivering those models into consumers’ hands. As deep learning rapidly takes over traditional computational pipelines, storing datasets in a Deep Lake is becoming the new norm.

    You can dive right into Deep Lake (yes, we will be making endless water puns) by trying out this Getting Started with Deep Lake Colab, and checkout our new C++ dataloader and query engine (Alpha) in this Colab. Join our slack community or book an introductory call with us if you want to start the onboarding immediately.

    Citations

    1. The Future of Deep Learning with Deep Lake. Activeloop
    2. Hambardzumyan, Sasun, et al. "Deep Lake: a Lakehouse for Deep Learning." arXiv preprint arXiv:2209.10785 (2022).
    3. Deep Lake — an architectural blueprint for managing Deep Learning data at scale — part I. Pihnasi, A.
    4. Ofeidis, Kiedanski, & Tassiulas "An overview of the data-loader landscape: comparative analysis." arXic preprint arXiv:2209.13705 (2022)
    5. Mohan, Jayashree, et al. "Analyzing and mitigating data stalls in dnn training." arXiv preprint arXiv:2007.06775 (2020).

    Share:

    • Table of Contents
    • Executive Summary
    • Behind the Scenes at Activeloop
    • Large Foundational Models Taking Over by Storm
    • Introduction to Deep Lake, the Data Lake for Deep Learning
    • What is the Deep Lake?
    • Machine Learning Loop with Deep Lake
    • Deep Lake and the Data Loader Landscape
    • The reasoning behind some of Deep Lake's architectural decisions
    • Where does Deep Lake fit in the MLOps?
    • Why we renamed Hub to Deep Lake?
    • Is there a Deep Lakehouse, and where does it come into place?
    • Does Deep Lake connect to the Modern Data Stack and MLOps tools?
    • What’s next for Deep Lake?
    • Citations
    • Previous
        • Tutorials
        • LangChain
      • 3 Ways to Build a Recommendation Engine for Songs with LangChain

      • on May 23, 2023
    • Next
        • Blog
      • Data-centric AI enablers. Best data-centric MLOps tools in 2022

      • on Oct 18, 2021
  • deep lake database

    Deep Lake. Database for AI.

    • Solutions
      AgricultureAudio ProcessingAutonomous Vehicles & RoboticsBiomedical & HealthcareMultimediaSafety & Security
    • Company
      AboutContact UsCareersPrivacy PolicyDo Not SellTerms & Conditions
    • Resources
      BlogDocumentationDeep Lake WhitepaperDeep Lake Academic Paper
  • Tensie

    Featured by

    featuredfeaturedfeaturedfeatured