Introducing Deep Lake PG: The Database for AI behind Smartest Scientific Agent

Today we are excited to open source Deep Lake PG (@activeloopai/deeplake), the database for AI behind Activeloop’s Scientific Discover that manages one of the largest datasets (175TB+) of indexed scientific research. Here are the details:

Deep Lake PG is the database for AI that provides both fast transactional queries for an agent’s memory store and scalable multimodal analytical queries across hundreds of TBs of indexed data.
Achieves state of the art cost efficiency on TPC-H compared to serverless data-warehouses. Running Deep Lake PG is 1.5x cheaper than Snowflake and up to 3x than Databricks.

Data-disconnect Blocks Economic Impact of AI

Despite billions spent on AI, enterprises and research institutions struggle to connect their proprietary data to these powerful models. As Microsoft CEO Satya Nadella recently noted, the “most important database” isn’t just rows and columns; it is the unstructured relationships between documents, chats, and business events.

“the most important database in any company, right, which is underneath your email, your documents, your Teams calls, what have you. It’s the relationships” - Satya Nadella, CEO Microsoft

Clearly modern data infrastructure including data warehouse and lakehouse is failing to solve it.

Introducing Deep Lake PG: The Database for AI

Managing 175TB+ indexed data on Activeloop Scientific Discover was only possible because of the data infrastructure underneath it. Today, we are releasing that infrastructure to everyone.

Deep Lake PG is the industry’s first unified database for AI agents. It combines a fully managed, serverless Postgres (for transactional state) with Deep Lake’s tensor storage (for multimodal data), all accessible via a SQL.

Deep Lake PG eliminates entire lakehouse complexity. It is one database with two superpowers:

Transactional Speed: Fully managed, serverless Postgres for low-latency agent memory and state.
Lake Scale: Deep Lake storage for multimodal/vector data at the petabyte scale.

In the past 5 years, we open sourced Deep Lake, while streaming fast to GPUs over network with 95% utilization, published academic paper at CIDR@23 on Lakehouse for Deep Learning, from lexical to multi-vector indexes backed by S3, deployed into production for Fortune 500 enterprises managing large scale data [1, 2, 3]. We were one of the first to ship Text-to-SQL with GPT, Multimodal Deep Research on Your Data and AI generated dashboards. While doing so, we realized the whole complexity that comes from building and managing what is at the core of any organization, Data.

With the launch of Deep Lake PG, this complexity disappears. Deep Lake PG is the industry’s first database that unifies a fully-managed, serverless Postgres with Deep Lake’s powerful engine for multimodal and vector data. The entire AI data stack now lives in one familiar place.

You get low-latency transactional capabilities for your agent’s state, seamlessly integrated with billion-scale vector search for long-term memory and petabyte multimodal analytics ready for fine-tuning models. No more need to worry about complex terms including data lake, lakehouse, data mesh, data fabric etc. This means you can build and deploy powerful AI agents without gluing together multiple data platforms, synchronizing databases, or reconciling security policies across systems.

datamix

Handles Capabilities for the Agentic Future

Modern AI apps aren’t just chatbots. They’re swarms of agents that read, plan, write, and revise. Current data systems struggle under this new reality of agents: thousands of exploratory queries, partial plans, rollbacks, branched transactions per single user task.

Deep Lake PG simplifies the entire lifecycle of building and operating AI applications by providing a single, coherent platform for all your data needs.

Multimodal by default: Text, tables, JSON, PDFs, images, and audio are indexed with lexical + vector on the lake.
One database, two superpowers: Fully managed, serverless Postgres for low‑latency state + Deep Lake for multimodal/vector data at lake scale.
Scale without Sharding: Deep Lake’s object storage write consistency enables ephemeral Postgres instances to horizontally scale, while streaming data to compute over the network on demand to increase throughput. All with a small memory footprint while serving a billion rows.
Branch & Merge Tables: Fast fork/commit/rollback for safe speculative writes - MVCC on steroids. So agents can explore without breaking prod.
A Unified API & Security Model: Interact with both your transactional and AI data through a single, easy-to-use API. Manage permissions and governance for all your data in one place, dramatically simplifying security.

By combining these, you can build stateful multimodal AI agents that can instantly recall recent interactions, while simultaneously reason over vast knowledge bases stored as vectors and stream to fine-tune models. All without managing separate infrastructure.

Simplifies Technical Debt Introduced by Lakehouse

A decade ago you were told to take your data from Postgres, ETL it into a warehouse. Then we said, no, move it into a data lake. Then bolt on a query engine, and let’s call that a Lakehouse. As the number of tables exploded, you unified into a catalog, and branded it a “semantic layer” to agree on definitions. Now, we’re told to reverse ETL back into the same Postgres tables, this time to power AI agents.

Database for AI

Layer upon layer. Patch after patch. A thousand little parts, scattered across a fragmented ecosystem (see MAD@25). As if every user, or every AI agent, was expected to build a rocket engine. When all you needed… was a car to drive.

And then, just when you think you’ve reached the end, the marketing machine invents new names: Lakehouse, Data Mesh. Data Fabric. Different labels, same problem: making data simple, reliable, usable at scale until now, for AI.

That’s why we built Deep Lake PG.

It consolidates the entire AI data stack into one simple platform. A SQL syntax that LLMs are well trained on: the database for AI on Postgres.

Technical Debt

Building sophisticated AI agents today feels like assembling a puzzle with pieces from different boxes. Developers need a fast, transactional database like Postgres for an agent’s short-term memory and state management. Simultaneously, they need a specialized vector database to handle long-term memory, retrieval-augmented generation (RAG), analytics over massive, unstructured datasets sitting on an object storage and from time to time stream to fine-tune models. This forces developers to stitch together fundamentally different systems.

The result is a brittle and complex AI data stack. Engineers waste countless hours building and maintaining fragile data pipelines just to keep the transactional and vector stores in sync. They wrestle with separate security models, networking configurations, and query languages. Every new feature requires updating multiple systems, increasing the risk of errors and slowing down innovation. This constant plumbing distracts from what truly matters: building intelligent, responsive, and reliable AI applications.

Tradeoffs of OLAP vs OLTP, Relational vs Non-Relational are becoming outdated for agentic workloads. Agents need both low latency, large analytical queries in SQL that supports multimodal data.

More Cost Efficient

Deep Lake PG achieves state-of-the-art cost efficiency on TPC-H SF100 compared to alternative serverless data warehouses. It is 1.5x cheaper than Snowflake and up to 3x than Databricks.

Benchmarks

Simple to Get Started

 
      
        1docker run -d -e POSTGRES_PASSWORD=postgres -p 5432:5432 \
2    quay.io/activeloopai/pg-deeplake:18
3

 
      
        1> psql -h localhost -p 5432 -U postgres
2
3-- 1. Enable extension
4CREATE EXTENSION pg_deeplake;
5
6-- 2. Create a table with DeepLake storage
7CREATE TABLE vectors (
8    id SERIAL PRIMARY KEY,
9    v1 float4[],
10    v2 float4[]
11) USING deeplake;
12
13-- 3. Create an index
14CREATE INDEX index_for_v1 ON vectors USING deeplake_index (v1 DESC);
15
16-- 4. Insert data
17INSERT INTO vectors (v1, v2) VALUES
18    (ARRAY[1.0, 2.0, 3.0], ARRAY[1.0, 2.0, 3.0]),
19    (ARRAY[4.0, 5.0, 6.0], ARRAY[4.0, 5.0, 6.0]),
20    (ARRAY[7.0, 8.0, 9.0], ARRAY[7.0, 8.0, 9.0]);
21
22-- 5. Query with cosine similarity
23SELECT id, v1 <#> ARRAY[1.0, 2.0, 3.0] AS score
24FROM vectors
25ORDER BY score DESC
26LIMIT 10;
27

A Unified Database for Every AI Workload

We believe that the future of AI isn’t just about better models; it’s about giving those models the right memory and access to reality. With Deep Lake PG, you can build stateful, multimodal agents that instantly recall conversations, reason over vast knowledge bases, and continuously learn - all without managing a dozen different data infrastructure tools.

Phoenix with elephant

We are also open sourcing the C++ code behind deep lake: Build with Deep Lake PG: Run it yourself on GitHub.

The agentic era is here. It’s time your database caught up.

What is next?

In the upcoming weeks, we will post technical deep dives on how it works, comprehensive tutorials on how you can leverage advanced Deep Lake PG capabilities such as indexing, storing multimodal data including images, video, audio, indexing on S3. And provide extensive benchmarks against lake native alternatives.

Want to deploy at your company?
While the open source is in its infancy, Deep Lake PG is already deployed in production at Activeloop.

Reach out to us enterprise@activeloop.ai to deploy.