Best Data-centric MLOps tools in 2023: Optimizing Datasets, for fun and profit
One of the more interesting trends in machine learning is the outsourcing of model design. Often working closely with the open-source community, these companies implement state-of-the-art models that often work well out of the box.
But while models have become increasingly standardized, especially with the emergence of “foundation” models like BERT, datasets have not. This disparity is due to cultural inertia (feature engineering is not something you put on a resume) and the dearth of tooling (there isn’t a framework, e.g. a PyTorch for data).
As practitioners who regularly work with terabyte-scale datasets, we have observed that the choice of model architectures matters less than the underlying data in the era of pre-trained, high-capacity models. That’s why we’re excited by the data-centric approach that’s popularized by Andrew Ng.
Ultimately, we think a shift from “get more data” to “get more out of data” represents the logical next step.
Data needs love, too
In the past decade, the responsibilities of machine learning engineers could be described as model-centric.
To simplify matters, they assumed datasets to be correct in order to focus on model-related tasks such as implementing loss functions, connecting the appropriate layers and optimizing hyperparameters.
However, this approach is inadequate in the era of large, noisy datasets, often collected with little scientific rigor. Black boxes notwithstanding, not enough time was spent on data preparation and feature engineering.
Data-centric AI is feature engineering, but snazzier
For those who grew up before 2014, you can think of data-centric AI as a catchier way of saying feature engineering without actually saying feature engineering. Many of its ideas - unambiguous labels and sufficient coverage of input space - should be familiar to anyone serious about machine learning.
And for those who weren’t, data-centric AI is a catch-all term for systematically reshaping data (correcting errors, filtering, augmentation, synthetic generation, etc.) to improve model performance and robustness. Alternatively, you can think of it as searching through input space rather than parameter space.
Problems, meet data centric solutions!
There is much more to a successful project than just models and datasets. This is especially true once datasets reach terabyte-scale, where they feel qualitatively different and thus pose a different set of questions:
- How can I find and correct labeling errors?
- How can I query or filter for relevant subsets?
- How can I transform data? If my dataset is modified, how can I track changes?
- How do I move my data to where it needs to go?
Since data does not have anything like gradient descent to move it in the appropriate direction, an entirely new tooling ecosystem is required to systematically get more out of data.
This is where MLOps comes in. In addition to Airflow, a good starting point for most teams is an end-to-end platform such as Google’s TensorFlow Extended (TFX), Netflix’s Metaflow, and Databrick’s MLFlow.
From there, teams can bolt on specific solutions. A few of our personal favorites include, in no particular order:
Snorkel AI. Snorkel is a platform for data labeling and annotation through semi-supervised learning. Its data labeling function concept is not immediately intuitive but is extremely powerful. Originally designed for academic researchers, it has found widespread adoption in enterprises, particularly in situations where there is few labeled data.
CVAT (Computer Vision Annotation Tool)
CVAT. Hands down, one of the most frequently used open-source annotation tools among data annotators. CVAT is an open-source platform for data labeling, spun out of Intel (was a part of the computing giant from 2017 to 2022). Under the hood, AI models, optimized by Intel’s OpenVINO Toolkit, help streamline annotation tasks. In summer of 20222, the CVAT team made a decision to separate from Intel and continued developing the new CVAT.ai on their own.
Clean Lab. An open-source package developed by MIT researchers, Clean Label has gained much attention for its work on identifying errors in well-known datasets like ImageNet and CIFAR. Its well-designed APIs make it a joy to use.
SuperAnnotate is an end-to-end AI data management platform, that grew out of a scalable labelling solution. Now, the tool supports not only data annotation, but also versioning of image, video, and text data. It’s one of the most streamlined tools out there for data annotation, and has in-built QA and tracking features, that help curate datasets much faster (probably that’s why it’s the G2 leader in data labeling software?!).
WhyLabs. With their AI observability platform, WhyLabs is the only ML monitoring tool that doesn’t touch raw data. You can integrate with any type of data (structured or unstructured) of any size for healthier model and data health. Fun fact, WhyLabs has recently raised capital from the chief data-centric AI enabler, Andrew Ng (it hardly gets more data-centric than this).
Tecton. Tecton develops feature stores (similar to feast). They are tackling one of the biggest problems in enterprise ML (model inputs are increasingly embeddings, not raw data) so I find it to be one of the most exciting machine learning tools in 2022.
SageMaker. AWS includes a number of options for moving data in and out of S3 buckets. SageMaker is a powerful tool, but would benefit from a more seamless way of connecting machine learning datasets to it. You can connect S3 to SageMaker effortlessly with Hub, our open-source solution, or Pipe.
YData. YData provides tools for data quality profiling and improvement as part of their data-centric development platform. YData is most known for delivering two popular open-source libraries, ydata-synthetic for structured datasets, a synthethic data generation package, and pandas-profling, which helps assessing the quality of your data. YData also supports the Data-Centric AI community.
Synthetic Data Vault
Synthetic Data Vault. An open-source package for generating synthetic datasets without the cost of training GANs. For the time being, it is limited to tabular structured data, which makes it less useful in our discussion of data centric solutions, but it’s a great tool nonetheless.
Arize AI. Arize, data-centric AI! With Arize’s ML observability platform, you can troubleshoot your models in production, with an entire suite of tools for model validation, versioning, drift detection, data quality checks, and model performance management. Improve your models by performing root cause analysis and visualizing where and why problems are emerging (no querying, just click into slices directly!).
Fiddler AI. Fiddler helps data scientists monitor, explain, and analyze their AI products. Explainability is currently one of the biggest open problems in enterprise ML, and Fiddler has one of the most sophisticated teams in this space. Speaking from personal experience, it has become increasingly popular with consumer-facing and e-commerce applications.
Arthur. Arthur provides performance monitoring, algorithmic bias detection, and explainability. In many ways, it feels like SageMaker Clarify. However, since bias detection is a relatively ill-defined problem at the moment, it is too soon to say if their data centric solutions are production-ready.
Algorithmia. Recently acquired by DataRobot, Algorithmia automates deployment on top of existing SDLC and CI/CD processes. In my opinion, they provide the most seamless integration between MLOps and traditional DevOps teams but are geared towards less sophisticated teams.
Deepchecks is an open-source Python package for, ehm, deepchecking all things machine learning: machine learning models, data integrity/quality, distribution mismatches, etc. If you’re stuggling with model explainability / performance, do give them a try! Using just a few lines of code, you’ll get reports based on pre-built test suites.
Galileo. Galileo is a data centric tool focusing on NLP use cases. The product’s super power is quickly surfacing data errors in your NLP datasets (classification and NER at the time of writing of this article). An interesting feature is a metric Galileo calculates called data error potential. It’s a score that helps identify samples in your text dataset impacting the model performance.
Seldon. Seldon tests, monitors and deploys models, and is especially popular with enterprises running on-premise clusters. Their open-source package is one of the most popular packages for deploying models on Kubernetes, despite its steep learning curve. Although they market to data scientists, they seem more appropriate for infrastructure engineers who happen to support ML pipelines.
Pachyderm. Pachyderm applies data version control to data pipelines. Its modular, open design along with a clean integration with Kubernetes means teams can quickly scale data transformation and model development. In our experience, it is a good upgrade for teams that have outgrown Airflow and need something better than “good enough”.
DVC. What does DVC stand for? DVC, or data version control, is one of the most popular “Git for data’’ solutions available. When in doubt, it’s usually a safe bet to go with DVC given its active community (100+ code contributors, 100+ documentation contributors, and thousands of users).
Superb AI’s platform sports features such as auto-labelling for datasets, and active learning capabilities. The Superb AI platform powers curation, labeling, and observability to enable higher impact for data science teams.
Dolt Hub. One of the newest additions to the dataset version control is Dolt (a tool focusing on structured data) is rapidly gaining traction. The latter also positions itself as a database and represents an interesting mix of two familiar concepts, Git and MySQL.
Neptune AI. Neptune collects model metadata in a single dashboard. Although there are a number of experiment logging tools (such as cnvrg, recently acquired by Intel), Neptune’s intuitive UI and minimal code overhead make it one of my favorite experimental logging dashboards.
Activeloop. Activeloop helps teams train models on petascale datasets. Deep Lake, its open-source project, provides a simple API for creating, storing, and collaborating on datasets of any size. Deep Lake also acts as a vector store for Large Language Model-based solutions with LangChain. In Deep learning, Activeloop Deep Lake makes it simple to transform and stream datasets from cloud object stores to popular frameworks such as TensorFlow and PyTorch, thus shortening training time and costs.
Collectively, these solutions help teams get the most out of their data in a systematic way.
Where Activeloop fits in
We are building a unified framework for data-centric AI, from preparation and preprocessing to model training. Deep Lake, our Data Lake for Deep Learning, makes it easy to reshape large datasets and move to where it is needed, with minimal code overhead. Deep Lake also enables teams instantly visualize, query, version-control, and explore their data to build better datasets (as well as materialize & stream them for training later).
In the past, by helping engineers work with petascale datasets stored remotely as if they are local, in-core datasets, Deep Lake has helped teams reduce ML iteration cycles by 2x and infrastructure bills by over 30%.
Together with our partners from the AI Infrastructure Alliance, we are developing best practices and architectures for doing large-scale ML in enterprises. To see how we help teams get the most out of their data while minimizing infrastructure costs, we invite you to try out Deep Lake.
Data-Centric AI FAQs
What is data centric AI?
Data-centric AI emphasizes the importance of high-quality machine learning datasets and managing them, rather than solely relying on advanced model architectures and algorithms. It addresses data-related issues such as noise, class imbalance, and feature engineering, leading to more accurate and reliable ML models.
Data-centric AI vs model-centric AI
The conventional model-centric method focuses on optimizing model architectures & algorithms. In its turn, data-centric AI prioritizes the quality and diversity of training data. This approach enhances performance by systematically refining data, improving labeling, as well as mitigating biases instead of solely depending on complex models.
Why is data quality vital in data-centric AI?
Data quality is crucial because ML models’ performance is dependent on the quality of input data. High-quality data ensures effective generalization to real-world scenarios, reducing biases and inaccuracies, leading to improved performance and more reliable predictions. Finally, sometimes you can’t get more data (especially in medical cases, for instance). Making sure the data you have is top-notch is hence of utmost importance.
How can machine learning engineers improve data labeling in a data-centric AI approach?
ML engineers can enhance data labeling by establishing clear labeling guidelines, employing multiple annotators to reduce biases, as well as using techniques like active learning for iterative label refinement. Regularly reviewing and updating labels based on error analysis contributes to better training data quality. Hence, this would help obtain better model accuracy and mitigate bias.