Connecting Research Data to Intelligence for Faster Scientific Discovery

Today we are excited to release Activeloop’s Scientific Discover, an intelligence agent built on one of the largest datasets of indexed scientific research. Here are the details:

Indexed open access scientific paper dataset with open access 25M papers, 450M+ pages. Total 175TB+.
Open-source scientific data agent that achieves 48% SOTA on Humanity Last Exam with tools and the indexed scientific research dataset.

US Needs to Unify All Scientific Data

The White House recently launched the Genesis Mission to unify all datasets for scientific discovery, recognizing that current infrastructure cannot support the AI agents needed to cure diseases or discover new materials.

The Genesis Mission will build an integrated AI platform to harness Federal scientific datasets — the world’s largest collection of such datasets, developed over decades of Federal investments — to train scientific foundation models and create AI agents to test new hypotheses, automate research workflows, and accelerate scientific breakthroughs.

Having fully indexed data for AI accelerates scientific discovery. It is one of the pillars to unlock immense economic value using ~~artificial~~ intelligence.

175TB of Scientific Research Data indexed on Deep Lake

We have successfully indexed 175TB of open-access scientific data, creating one of the world’s largest AI-ready scientific datasets. It is a fully structured, multimodal knowledge base powered by Deep Lake.

Traditional search engines see scientific papers as flat text mostly just titles and abstracts, often discarding the most critical data resorted within papers such as charts, molecular structures, and mathematical formulas. By utilizing Deep Lake’s tensor-based storage, we have preserved this multimodal context, allowing our AI agents to “read” papers with the same visual and semantic understanding as a human researcher.

Scale: 25 million open-access papers comprising over 450+ million pages.
Multimodality: Images, tables, and graphs are indexed alongside text, preserving the relationships between distinct data types (e.g., a chemical structure image linked to its textual description).
Cutoff Date: March, 2025
Infrastructure: Built on Deep Lake’s “Index-on-the-Lake” technology, this dataset is stored efficiently on S3 Express, enabling sub-second retrieval of complex multimodal queries without the latency or cost of traditional vector databases.

This dataset serves as the foundational “brain” for the L1 Science Data Agent, ensuring it retrieves answers based on ground-truth scientific evidence rather than hallucination. You run the queries over API or at chat.activeloop.ai/science

ScienceArchitecture

Introducing L1: Science Data Agent

To prove the power of a true AI-native database, we built Activeloop L1, a scientific data agent. It has access to largest visually indexed scientific dataset in history with 25 million open-access papers, 450+ million pages, and over 175TB of data. Unlike traditional text-mining, L1 “sees” the papers: it indexes charts, molecules, formulas, and tables alongside text within papers.

Here are the results:

48% SOTA on Humanity’s Last Exam: It outperforms all existing models with tools.
Multimodal Scientific Discovery: It can answer queries that require synthesizing visual data from protein structures with textual analysis from clinical trials.

You can search across not only text, but also charts, tables and all multimodal data within articles. It enables AI-powered platforms that accelerates scientific discovery including drug discovery, material science and algorithmic improvements.

Humanity’s Last Exam

HLE_Benchmark

Equipping the data agent with 3 tools: Code Interpreter, Web Search and Scientific Search via Activeloop API we achieve state of the art results on HLE benchmark. The agent achieves 43% accuracy in single pass, and with pass@2 48% attempting all 2500 queries including ones containing images or gifs. LLM cost per single iteration is under $1.

While not exactly same, one might speculate that Deep Think, GPT5 Pro and Grok Heavy employ 8 parallel trajectories simultaneously and then aggregate the final result. It makes an approximately equivalent case to pass@8 score.

To the most recent concerns of benchmark leakage especially with web search, we blocked access to the HuggingFace website. Furthermore, using LLMs we analyzed all executed traces to identify potential answer leaks. While 4.9% of answers suspected in leakage, only 0.2% where instances of tool usage (web search and scientific search).

We are open sourcing the full code to reproduce the benchmarks at https://github.com/activeloopai/hle_with_tools

Wave Form Simulation
Science agent generates code to simulate wave grid mimicking the provided input image to reverse engineer the parameters (question_id: 66eb4602d4b21d010e93bbb9).

Faster Science Discovery over API

Multimodal scientific research involves using AI to analyze and integrate data from multiple, diverse sources or modalities to gain a more holistic and accurate understanding of diseases and potential treatments. Instead of relying on a text-based research paper, multimodal models combine information from all of them simultaneously. This approach mirrors how human experts synthesize knowledge from different sources.

You can try the agent today via our OpenAI-compliant API:

 
      
        1from openai import OpenAI
2
3client = OpenAI(
4    base_url="https://science-api.activeloop.ai/",
5    api_key=os.getenv('ACTIVELOOP_TOKEN')
6)
7
8response = client.chat.completions.create(
9    model="activeloop-l1-retrieval",
10    messages=[
11        {
12            "role": "user",
13            "content": "Which compounds showing a novel synergistic effect" +
14                                 "with metformin in treating type 2 diabetes?"
15        }
16    ]
17)
18

You can get API KEY by signing up and subscribing at chat.activeloop.ai and learn more about the usage in docs.

Key Applications for empowering Genesis Mission

Its main applications include:

Accelerating Biotechnology & Target Identification: Aligned with the mission to “cure diseases,” multimodal AI correlates diverse data such as gene expression, protein interactions, and clinical outcomes to pinpoint viable drug targets faster than humanly possible.
Critical Materials & Energy Dominance: Essential for “nuclear fission, fusion, and energy dominance.” The agent can explore vast chemical spaces to generate candidate structures for next-gen batteries or superalloys that satisfy conflicting properties like efficacy, safety, and thermal stability.
Semiconductors & Advanced Manufacturing: Supporting the race for “global technology dominance.” By indexing fabrication diagrams and material properties from millions of papers, the agent can suggest process improvements and novel material compositions for microelectronics.

Get Started Today:
Activeloop has already built the data infrastructure to handle scientific data required to enable faster science discovery including findings cure diseases or better materials.
Try the Science Agent: chat.activeloop.ai/science