Use LangChain, OpenAI GPT, & Deep Lake to Chat with CSVs, PDFs, JSONs, GitHub Repos, URLs, & More

datachad chat with data langchain

We’ve previously explored chatting with PDFs or understanding GitHub repos with LangChain. Many apps are popping up here and there inspired by those use cases, but DataChad, created by our community member Gustav von Zitzewitz, takes it several steps further, and works both locally or in the cloud, and allows chatting with multiple data sources of various types (PDFs, Excel sheets, etc.) at the same time.

DataChad is an open-source project that allows users to ask questions about any data source by leveraging embeddings, Deep Lake as a vector database, large language models like GPT-3.5-turbo or GPT-4, and LangChain. The data source can be anything from a local file like a pdf or CSV to a website url, a GitHub repository, or even the path to a directory, scanned recursively if the app is deployed locally. The app now supports Local Mode, where all data is processed only locally and no API calls are made. This is made possible by leveraging pre-trained open source LLM models like GPT4all, and creating Deep Lake-powered embedding storage on the local disk vs in the Deep Lake cloud.

The app works by uploading any file or entering any way or URL (or pointing to the location of your files using the Local Mode). Subsequently, the app detects and loads the data source into text documents, embeds the text documents using OpenAI embeddings, then stores them embeddings as a vector dataset to Activeloop’s Deep Lake Cloud. A Langchain is established, comprising an LLM model and the embedding database index as a retriever. This chain serves as the context for answering user queries over any data they upload.

Why Do You Need a Chat With Any Data App?

DataChad is designed to serve as an indispensable tool for individuals who require swift and precise data querying from any source.

Whether you’re seeking a comprehensive understanding of a complete project or looking for swift answers from a single data source without manually sifting through the material (say, a Wikipedia article, codebase, or an academic paper you’re cramming), with DataChad, users can ask natural language questions and get relevant answers in seconds without writing complex SQL queries or using other data querying tools.

Finally, the app can be hosted and used from anywhere, like in the demo, or deployed locally to enable querying local directories. This makes it essential to be able to run this types of solutions locally, without the need to send companies like OpenAI your data (in that case, you’d need to use an open-source large language model).

Editorial Note on OpenAI Embeddings

Costs can become a factor for extensive OpenAI API usage. To have full transparency and control of this critical factor, DataChad will display the app’s usage of tokens and total costs in $. To get a feeling about the scale, prompts using the maximum number of tokens of 4069 still lay well below a single cent of total usage costs.

How DataChad Works: Architectural Blueprint for AI-powered Chat with Data App

OpenAI Embeddings

DataChad uses OpenAI Embeddings to convert text documents into vectors that can be indexed and searched efficiently. OpenAI’s embeddings are instrumental in evaluating the semantic similarity between two or more text fragments or the relevance of extensive documents to a concise query. They are extensively vital for tasks like search or classification. OpenAI embeddings employ the cosine similarity method to calculate the similarity between documents and a question.

Vector Database

DataChad uses Deep Lake - the vector database for all AI data, to store the embeddings generated from the text documents. Vector databases are designed to store and search vectors efficiently and are optimized for large-scale datasets. Deep Lake stands out from various vector databases in its multi-modality (i.e., ability to support multiple data types and store embedding metadata). It is highly relevant if you’re looking to build an all-in-one chat with a data app like DataChad.

Large Language Models (LLMs)

DataChad uses large language models like GPT-3.5 Turbo to generate responses to user questions. LLMs are powerful models trained on massive amounts of text data that can generate natural language responses to a wide range of questions.

LangChain

DataChad uses LangChain to combine the embeddings and LLMs into a single retrieval chain that can be used to answer user questions. LangChain is a powerful technique for integrating natural language processing tools into a single pipeline. Read this ultimate LangChain guide if you want to understand the power of LangChain.

Streamlit

DataChad is implemented as a Streamlit app, a way to build demo apps in Python quickly. It takes the pain away from caring about how to implement a UI and how to host the app properly and lets you focus on the backend work.

Factors to Consider as You Build a LangChain & Large Language Model-based app (k. arg, chunks, etc.)

The DataChad project is built upon the fusion of two critical natural language processing (NLP) technologies, leveraging the attention mechanism of Language Model-based Models (LLMs) like GPT-4 through the OpenAI API and employing vector similarity for efficient embedding comparison when querying the vector database. This combination allows for robust analysis and retrieval of information from textual data. Let’s delve into the details, focusing on the querying parameters of the vector database within DataChad.

The Attention Mechanism of LLMs

DataChad taps into the attention mechanism offered by LLMs, such as GPT-3, using the OpenAI API. This attention mechanism enables the model to weigh the importance of different words or tokens in a text sequence, capturing contextual relationships and semantic nuances. By leveraging LLMs, DataChad benefits from their ability to generate rich and accurate representations of textual data.

Vector Similarity for Embedding Comparison

When querying the vector database, DataChad employs vector similarity to compare document embeddings. This technique measures the geometric similarity between embeddings, allowing for the efficient retrieval of similar documents. Vector similarity provides a simple yet effective method for identifying related content in large-scale datasets.

Parameters for Querying the Vector Database and the LLM

DataChad’s querying process involves several important parameters that influence the retrieval and analysis of document embeddings. What are those parameters?

chunk_size

chunk_size in LangChain-based apps determines the size at which the text is divided into smaller chunks before being embedded. This parameter ensures the efficient processing of large documents and controls the granularity of the resulting embeddings. The DataChad default is 1000.

fetch_k

fetch_k in LangChain-based apps specifies the number of documents to pull from the vector database. This parameter determines the scope of the search and influences the relevance of the retrieved documents. The DataChad default is 20.

k

The k in LangChain-based apps represents the most similar embeddings selected to build the context for the LLM prompt in the langchain. This parameter affects the contextual understanding and response generation of the LLM when querying the OpenAI API. The DataChad default is 10.

max_tokens

The max_tokens parameter limits the documents returned from the vector store based on tokens before building the context to query the LLM. This parameter ensures that DataChad does not run into the LLM’s prompt limit (4069 for gpt-3.5-turbo). The DataChad default is 3375.

temperature

LLM temperature controls the randomness of the LLM output. A temperature of 0 means the response is deterministic: it always returns the same completion (making it significantly less prone to hallucination). A temperature of greater than zero results in increasing variation in the completion. The DataChad default is 0.7.

By carefully tuning these parameters, DataChad optimizes the trade-off between computational efficiency and the quality of results obtained from both the vector database and LLM-based querying. By ticking the Advanced Options checkbox in the app, experienced users can further modify these parameters.

How to Solve Most Common Issues When Building With LangChain

The previous section discussed the importance of selecting appropriate parameters for querying the vector database and the language model within the DataChad project. However, despite the default values having been carefully chosen and tested, it is not uncommon to encounter challenges or the desire for further improvement in the overall query experience. In this section, we will address some common issues you may face as you build your app and provide suggested solutions that can help overcome these challenges.

Issue 1: Running into errors related to the prompt length

Solution: Decrease one or many of k, chunk_size, and max_tokens.

Issue 2: The answers contain hallucinations or do not match the true data content

Solution: Decrease the temperature. Set it to 0 for the most conservative answers that are unlikely to deviate from the sources.

Issue 3: The answers are not relevant enough

Solution: Increase chunk_size, or if this leads to running into issue 1, then increase k and fetch_k while decreasing chunk_size

Practical Guide: Building an All-In-One Chat with Anything App

The code is split in three parts. First, we build out the Streamlit app defined in app.py. The second part, utils.py, contains all processing functionality and API calls. Final part is constants.py, where all project-specific paths, names, and descriptions are defined.

app.py

 
      
        1import streamlit as st
2from streamlit_chat import message
3
4from constants import (
5    ACTIVELOOP_HELP,
6    APP_NAME,
7    AUTHENTICATION_HELP,
8    CHUNK_SIZE,
9    DEFAULT_DATA_SOURCE,
10    ENABLE_ADVANCED_OPTIONS,
11    FETCH_K,
12    MAX_TOKENS,
13    OPENAI_HELP,
14    PAGE_ICON,
15    REPO_URL,
16    TEMPERATURE,
17    USAGE_HELP,
18    K,
19)
20from utils import (
21    advanced_options_form,
22    authenticate,
23    delete_uploaded_file,
24    generate_response,
25    logger,
26    save_uploaded_file,
27    update_chain,
28)
29
30# Page options and header
31st.set_option("client.showErrorDetails", True)
32st.set_page_config(
33    page_title=APP_NAME, page_icon=PAGE_ICON, initial_sidebar_state="expanded"
34)
35st.markdown(
36    f"<h1 style='text-align: center;'>{APP_NAME} {PAGE_ICON} <br> I know all about your data!</h1>",
37    unsafe_allow_html=True,
38)
39
40# Initialise session state variables
41# Chat and Data Source
42if "past" not in st.session_state:
43    st.session_state["past"] = []
44if "usage" not in st.session_state:
45    st.session_state["usage"] = {}
46if "chat_history" not in st.session_state:
47    st.session_state["chat_history"] = []
48if "generated" not in st.session_state:
49    st.session_state["generated"] = []
50if "data_source" not in st.session_state:
51    st.session_state["data_source"] = DEFAULT_DATA_SOURCE
52if "uploaded_file" not in st.session_state:
53    st.session_state["uploaded_file"] = None
54# Authentication and Credentials
55if "auth_ok" not in st.session_state:
56    st.session_state["auth_ok"] = False
57if "openai_api_key" not in st.session_state:
58    st.session_state["openai_api_key"] = None
59if "activeloop_token" not in st.session_state:
60    st.session_state["activeloop_token"] = None
61if "activeloop_org_name" not in st.session_state:
62    st.session_state["activeloop_org_name"] = None
63# Advanced Options
64if "k" not in st.session_state:
65    st.session_state["k"] = K
66if "fetch_k" not in st.session_state:
67    st.session_state["fetch_k"] = FETCH_K
68if "chunk_size" not in st.session_state:
69    st.session_state["chunk_size"] = CHUNK_SIZE
70if "temperature" not in st.session_state:
71    st.session_state["temperature"] = TEMPERATURE
72if "max_tokens" not in st.session_state:
73    st.session_state["max_tokens"] = MAX_TOKENS
74
75# Sidebar with Authentication
76# Only start App if authentication is OK
77with st.sidebar:
78    st.title("Authentication", help=AUTHENTICATION_HELP)
79    with st.form("authentication"):
80        openai_api_key = st.text_input(
81            "OpenAI API Key",
82            type="password",
83            help=OPENAI_HELP,
84            placeholder="This field is mandatory",
85        )
86        activeloop_token = st.text_input(
87            "ActiveLoop Token",
88            type="password",
89            help=ACTIVELOOP_HELP,
90            placeholder="Optional, using ours if empty",
91        )
92        activeloop_org_name = st.text_input(
93            "ActiveLoop Organisation Name",
94            type="password",
95            help=ACTIVELOOP_HELP,
96            placeholder="Optional, using ours if empty",
97        )
98        submitted = st.form_submit_button("Submit")
99        if submitted:
100            authenticate(openai_api_key, activeloop_token, activeloop_org_name)
101
102    st.info(f"Learn how it works [here]({REPO_URL})")
103    if not st.session_state["auth_ok"]:
104        st.stop()
105
106    # Clear button to reset all chat communication
107    clear_button = st.button("Clear Conversation", key="clear")
108
109    # Advanced Options
110    if ENABLE_ADVANCED_OPTIONS:
111        advanced_options_form()
112
113# the chain can only be initialized after authentication is OK
114if "chain" not in st.session_state:
115    update_chain()
116
117if clear_button:
118    # resets all chat history related caches
119    st.session_state["past"] = []
120    st.session_state["generated"] = []
121    st.session_state["chat_history"] = []
122
123# file upload and data source inputs
124uploaded_file = st.file_uploader("Upload a file")
125data_source = st.text_input(
126    "Enter any data source",
127    placeholder="Any path or url pointing to a file or directory of files",
128)
129
130# generate new chain for new data source / uploaded file
131# make sure to do this only once per input / on change
132if data_source and data_source != st.session_state["data_source"]:
133    logger.info(f"Data source provided: '{data_source}'")
134    st.session_state["data_source"] = data_source
135    update_chain()
136
137if uploaded_file and uploaded_file != st.session_state["uploaded_file"]:
138    logger.info(f"Uploaded file: '{uploaded_file.name}'")
139    st.session_state["uploaded_file"] = uploaded_file
140    data_source = save_uploaded_file(uploaded_file)
141    st.session_state["data_source"] = data_source
142    update_chain()
143    delete_uploaded_file(uploaded_file)
144
145# container for chat history
146response_container = st.container()
147# container for text box
148container = st.container()
149
150# As streamlit reruns the whole script on each change
151# it is necessary to repopulate the chat containers
152with container:
153    with st.form(key="prompt_input", clear_on_submit=True):
154        user_input = st.text_area("You:", key="input", height=100)
155        submit_button = st.form_submit_button(label="Send")
156
157    if submit_button and user_input:
158        output = generate_response(user_input)
159        st.session_state["past"].append(user_input)
160        st.session_state["generated"].append(output)
161
162if st.session_state["generated"]:
163    with response_container:
164        for i in range(len(st.session_state["generated"])):
165            message(st.session_state["past"][i], is_user=True, key=str(i) + "_user")
166            message(st.session_state["generated"][i], key=str(i))
167
168# Usage sidebar with total used tokens and costs
169# We put this at the end to be able to show usage starting with the first response
170with st.sidebar:
171    if st.session_state["usage"]:
172        st.divider()
173        st.title("Usage", help=USAGE_HELP)
174        col1, col2 = st.columns(2)
175        col1.metric("Total Tokens", st.session_state["usage"]["total_tokens"])
176        col2.metric("Total Costs in $", st.session_state["usage"]["total_cost"])
177
178

utils.py

 
      
        1import logging
2import os
3import re
4import shutil
5import sys
6from typing import List
7
8import deeplake
9import openai
10import streamlit as st
11from dotenv import load_dotenv
12from langchain.callbacks import OpenAICallbackHandler, get_openai_callback
13from langchain.chains import ConversationalRetrievalChain
14from langchain.chat_models import ChatOpenAI
15from langchain.document_loaders import (
16    CSVLoader,
17    DirectoryLoader,
18    GitLoader,
19    NotebookLoader,
20    OnlinePDFLoader,
21    PythonLoader,
22    TextLoader,
23    UnstructuredFileLoader,
24    UnstructuredHTMLLoader,
25    UnstructuredPDFLoader,
26    UnstructuredWordDocumentLoader,
27    WebBaseLoader,
28)
29from langchain.embeddings.openai import OpenAIEmbeddings
30from langchain.schema import Document
31from langchain.text_splitter import RecursiveCharacterTextSplitter
32from langchain.vectorstores import DeepLake, VectorStore
33from streamlit.runtime.uploaded_file_manager import UploadedFile
34
35from constants import (
36    APP_NAME,
37    CHUNK_SIZE,
38    DATA_PATH,
39    FETCH_K,
40    MAX_TOKENS,
41    MODEL,
42    PAGE_ICON,
43    REPO_URL,
44    TEMPERATURE,
45    K,
46)
47
48# loads environment variables
49load_dotenv()
50
51logger = logging.getLogger(APP_NAME)
52
53def configure_logger(debug: int = 0) -> None:
54    # boilerplate code to enable logging in the streamlit app console
55    log_level = logging.DEBUG if debug == 1 else logging.INFO
56    logger.setLevel(log_level)
57
58    stream_handler = logging.StreamHandler(stream=sys.stdout)
59    stream_handler.setLevel(log_level)
60
61    formatter = logging.Formatter("%(message)s")
62
63    stream_handler.setFormatter(formatter)
64
65    logger.addHandler(stream_handler)
66    logger.propagate = False
67
68configure_logger(0)
69
70def authenticate(
71    openai_api_key: str, activeloop_token: str, activeloop_org_name: str
72) -> None:
73    # Validate all credentials are set and correct
74    # Check for env variables to enable local dev and deployments with shared credentials
75    openai_api_key = (
76        openai_api_key
77        or os.environ.get("OPENAI_API_KEY")
78        or st.secrets.get("OPENAI_API_KEY")
79    )
80    activeloop_token = (
81        activeloop_token
82        or os.environ.get("ACTIVELOOP_TOKEN")
83        or st.secrets.get("ACTIVELOOP_TOKEN")
84    )
85    activeloop_org_name = (
86        activeloop_org_name
87        or os.environ.get("ACTIVELOOP_ORG_NAME")
88        or st.secrets.get("ACTIVELOOP_ORG_NAME")
89    )
90    if not (openai_api_key and activeloop_token and activeloop_org_name):
91        st.session_state["auth_ok"] = False
92        st.error("Credentials neither set nor stored", icon=PAGE_ICON)
93        return
94    try:
95        # Try to access openai and deeplake
96        with st.spinner("Authentifying..."):
97            openai.api_key = openai_api_key
98            openai.Model.list()
99            deeplake.exists(
100                f"hub://{activeloop_org_name}/DataChad-Authentication-Check",
101                token=activeloop_token,
102            )
103    except Exception as e:
104        logger.error(f"Authentication failed with {e}")
105        st.session_state["auth_ok"] = False
106        st.error("Authentication failed", icon=PAGE_ICON)
107        return
108    # store credentials in the session state
109    st.session_state["auth_ok"] = True
110    st.session_state["openai_api_key"] = openai_api_key
111    st.session_state["activeloop_token"] = activeloop_token
112    st.session_state["activeloop_org_name"] = activeloop_org_name
113    logger.info("Authentification successful!")
114
115def advanced_options_form() -> None:
116    # Input Form that takes advanced options and rebuilds chain with them
117    advanced_options = st.checkbox(
118        "Advanced Options", help="Caution! This may break things!"
119    )
120    if advanced_options:
121        with st.form("advanced_options"):
122            temperature = st.slider(
123                "temperature",
124                min_value=0.0,
125                max_value=1.0,
126                value=TEMPERATURE,
127                help="Controls the randomness of the language model output",
128            )
129            col1, col2 = st.columns(2)
130            fetch_k = col1.number_input(
131                "k_fetch",
132                min_value=1,
133                max_value=1000,
134                value=FETCH_K,
135                help="The number of documents to pull from the vector database",
136            )
137            k = col2.number_input(
138                "k",
139                min_value=1,
140                max_value=100,
141                value=K,
142                help="The number of most similar documents to build the context from",
143            )
144            chunk_size = col1.number_input(
145                "chunk_size",
146                min_value=1,
147                max_value=100000,
148                value=CHUNK_SIZE,
149                help=(
150                    "The size at which the text is divided into smaller chunks "
151                    "before being embedded.\n\nChanging this parameter makes re-embedding "
152                    "and re-uploading the data to the database necessary "
153                ),
154            )
155            max_tokens = col2.number_input(
156                "max_tokens",
157                min_value=1,
158                max_value=4069,
159                value=MAX_TOKENS,
160                help="Limits the documents returned from database based on number of tokens",
161            )
162            applied = st.form_submit_button("Apply")
163            if applied:
164                st.session_state["k"] = k
165                st.session_state["fetch_k"] = fetch_k
166                st.session_state["chunk_size"] = chunk_size
167                st.session_state["temperature"] = temperature
168                st.session_state["max_tokens"] = max_tokens
169                update_chain()
170
171def save_uploaded_file(uploaded_file: UploadedFile) -> str:
172    # streamlit uploaded files need to be stored locally
173    # before embedded and uploaded to the hub
174    if not os.path.exists(DATA_PATH):
175        os.makedirs(DATA_PATH)
176    file_path = str(DATA_PATH / uploaded_file.name)
177    uploaded_file.seek(0)
178    file_bytes = uploaded_file.read()
179    file = open(file_path, "wb")
180    file.write(file_bytes)
181    file.close()
182    logger.info(f"Saved: {file_path}")
183    return file_path
184
185def delete_uploaded_file(uploaded_file: UploadedFile) -> None:
186    # cleanup locally stored files
187    file_path = DATA_PATH / uploaded_file.name
188    if os.path.exists(DATA_PATH):
189        os.remove(file_path)
190        logger.info(f"Removed: {file_path}")
191
192def handle_load_error(e: str = None) -> None:
193    e = e or "No Loader found for your data source. Consider contributing:  {REPO_URL}!"
194    error_msg = f"Failed to load {st.session_state['data_source']} with Error:\n{e}"
195    st.error(error_msg, icon=PAGE_ICON)
196    logger.info(error_msg)
197    st.stop()
198
199def load_git(data_source: str, chunk_size: int = CHUNK_SIZE) -> List[Document]:
200    # We need to try both common main branches
201    # Thank you github for the "master" to "main" switch
202    repo_name = data_source.split("/")[-1].split(".")[0]
203    repo_path = str(DATA_PATH / repo_name)
204    text_splitter = RecursiveCharacterTextSplitter(
205        chunk_size=chunk_size, chunk_overlap=0
206    )
207    branches = ["main", "master"]
208    for branch in branches:
209        if os.path.exists(repo_path):
210            data_source = None
211        try:
212            docs = GitLoader(repo_path, data_source, branch).load_and_split(
213                text_splitter
214            )
215            break
216        except Exception as e:
217            logger.info(f"Error loading git: {e}")
218    if os.path.exists(repo_path):
219        # cleanup repo afterwards
220        shutil.rmtree(repo_path)
221    try:
222        return docs
223    except Exception as e:
224        handle_load_error()
225
226def load_any_data_source(
227    data_source: str, chunk_size: int = CHUNK_SIZE
228) -> List[Document]:
229    # Ugly thing that decides how to load data
230    # It aint much, but it's honest work
231    is_text = data_source.endswith(".txt")
232    is_web = data_source.startswith("http")
233    is_pdf = data_source.endswith(".pdf")
234    is_csv = data_source.endswith("csv")
235    is_html = data_source.endswith(".html")
236    is_git = data_source.endswith(".git")
237    is_notebook = data_source.endswith(".ipynb")
238    is_doc = data_source.endswith(".doc")
239    is_py = data_source.endswith(".py")
240    is_dir = os.path.isdir(data_source)
241    is_file = os.path.isfile(data_source)
242
243    loader = None
244    if is_dir:
245        loader = DirectoryLoader(data_source, recursive=True, silent_errors=True)
246    elif is_git:
247        return load_git(data_source, chunk_size)
248    elif is_web:
249        if is_pdf:
250            loader = OnlinePDFLoader(data_source)
251        else:
252            loader = WebBaseLoader(data_source)
253    elif is_file:
254        if is_text:
255            loader = TextLoader(data_source)
256        elif is_notebook:
257            loader = NotebookLoader(data_source)
258        elif is_pdf:
259            loader = UnstructuredPDFLoader(data_source)
260        elif is_html:
261            loader = UnstructuredHTMLLoader(data_source)
262        elif is_doc:
263            loader = UnstructuredWordDocumentLoader(data_source)
264        elif is_csv:
265            loader = CSVLoader(data_source, encoding="utf-8")
266        elif is_py:
267            loader = PythonLoader(data_source)
268        else:
269            loader = UnstructuredFileLoader(data_source)
270    try:
271        # Chunk size is a major trade-off parameter to control result accuracy over computaion
272        text_splitter = RecursiveCharacterTextSplitter(
273            chunk_size=chunk_size, chunk_overlap=0
274        )
275        docs = loader.load_and_split(text_splitter)
276        logger.info(f"Loaded: {len(docs)} document chucks")
277        return docs
278    except Exception as e:
279        handle_load_error(e if loader else None)
280
281def clean_data_source_string(data_source_string: str) -> str:
282    # replace all non-word characters with dashes
283    # to get a string that can be used to create a new dataset
284    dashed_string = re.sub(r"\W+", "-", data_source_string)
285    cleaned_string = re.sub(r"--+", "- ", dashed_string).strip("-")
286    return cleaned_string
287
288def setup_vector_store(data_source: str, chunk_size: int = CHUNK_SIZE) -> VectorStore:
289    # either load existing vector store or upload a new one to the hub
290    embeddings = OpenAIEmbeddings(
291        disallowed_special=(), openai_api_key=st.session_state["openai_api_key"]
292    )
293    data_source_name = clean_data_source_string(data_source)
294    dataset_path = f"hub://{st.session_state['activeloop_org_name']}/{data_source_name}-{chunk_size}"
295    if deeplake.exists(dataset_path, token=st.session_state["activeloop_token"]):
296        with st.spinner("Loading vector store..."):
297            logger.info(f"Dataset '{dataset_path}' exists -> loading")
298            vector_store = DeepLake(
299                dataset_path=dataset_path,
300                read_only=True,
301                embedding_function=embeddings,
302                token=st.session_state["activeloop_token"],
303            )
304    else:
305        with st.spinner("Reading, embedding and uploading data to hub..."):
306            logger.info(f"Dataset '{dataset_path}' does not exist -> uploading")
307            docs = load_any_data_source(data_source, chunk_size)
308            vector_store = DeepLake.from_documents(
309                docs,
310                embeddings,
311                dataset_path=dataset_path,
312                token=st.session_state["activeloop_token"],
313            )
314    return vector_store
315
316def build_chain(
317    data_source: str,
318    k: int = K,
319    fetch_k: int = FETCH_K,
320    chunk_size: int = CHUNK_SIZE,
321    temperature: float = TEMPERATURE,
322    max_tokens: int = MAX_TOKENS,
323) -> ConversationalRetrievalChain:
324    # create the langchain that will be called to generate responses
325    vector_store = setup_vector_store(data_source, chunk_size)
326    retriever = vector_store.as_retriever()
327    # Search params "fetch_k" and "k" define how many documents are pulled from the hub
328    # and selected after the document matching to build the context
329    # that is fed to the model together with your prompt
330    search_kwargs = {
331        "maximal_marginal_relevance": True,
332        "distance_metric": "cos",
333        "fetch_k": fetch_k,
334        "k": k,
335    }
336    retriever.search_kwargs.update(search_kwargs)
337    model = ChatOpenAI(
338        model_name=MODEL,
339        temperature=temperature,
340        openai_api_key=st.session_state["openai_api_key"],
341    )
342    chain = ConversationalRetrievalChain.from_llm(
343        model,
344        retriever=retriever,
345        chain_type="stuff",
346        verbose=True,
347        # we limit the maximum number of used tokens
348        # to prevent running into the models token limit of 4096
349        max_tokens_limit=max_tokens,
350    )
351    logger.info(f"Data source '{data_source}' is ready to go!")
352    return chain
353
354def update_chain() -> None:
355    # Build chain with parameters from session state and store it back
356    # Also delete chat history to not confuse the bot with old context
357    try:
358        st.session_state["chain"] = build_chain(
359            data_source=st.session_state["data_source"],
360            k=st.session_state["k"],
361            fetch_k=st.session_state["fetch_k"],
362            chunk_size=st.session_state["chunk_size"],
363            temperature=st.session_state["temperature"],
364            max_tokens=st.session_state["max_tokens"],
365        )
366        st.session_state["chat_history"] = []
367    except Exception as e:
368        msg = f"Failed to build chain for data source {st.session_state['data_source']} with error: {e}"
369        logger.error(msg)
370        st.error(msg, icon=PAGE_ICON)
371
372def update_usage(cb: OpenAICallbackHandler) -> None:
373    # Accumulate API call usage via callbacks
374    logger.info(f"Usage: {cb}")
375    callback_properties = [
376        "total_tokens",
377        "prompt_tokens",
378        "completion_tokens",
379        "total_cost",
380    ]
381    for prop in callback_properties:
382        value = getattr(cb, prop, 0)
383        st.session_state["usage"].setdefault(prop, 0)
384        st.session_state["usage"][prop] += value
385
386def generate_response(prompt: str) -> str:
387    # call the chain to generate responses and add them to the chat history
388    with st.spinner("Generating response"), get_openai_callback() as cb:
389        response = st.session_state["chain"](
390            {"question": prompt, "chat_history": st.session_state["chat_history"]}
391        )
392        update_usage(cb)
393    logger.info(f"Response: '{response}'")
394    st.session_state["chat_history"].append((prompt, response["answer"]))
395    return response["answer"]
396
397

constants.py

 
      
        1from pathlib import Path
2
3APP_NAME = "DataChad"
4MODEL = "gpt-3.5-turbo"
5PAGE_ICON = "🤖"
6
7K = 10
8FETCH_K = 20
9CHUNK_SIZE = 1000
10TEMPERATURE = 0.7
11MAX_TOKENS = 3357
12ENABLE_ADVANCED_OPTIONS = True
13
14DATA_PATH = Path.cwd() / "data"
15DEFAULT_DATA_SOURCE = "[email protected]:gustavz/DataChad.git"
16
17REPO_URL = "https://github.com/gustavz/DataChad"
18
19AUTHENTICATION_HELP = f"""
20Your credentials are only stored in your session state.\n
21The keys are neither exposed nor made visible or stored permanently in any way.\n
22Feel free to check out [the code base]({REPO_URL}) to validate how things work.
23"""
24
25USAGE_HELP = f"""
26These are the accumulated OpenAI API usage metrics.\n
27The app uses '{MODEL}' for chat and 'text-embedding-ada-002' for embeddings.\n
28Learn more about OpenAI's pricing [here](https://openai.com/pricing#language-models)
29"""
30
31OPENAI_HELP = """
32You can sign-up for OpenAI's API [here](https://openai.com/blog/openai-api).\n
33Once you are logged in, you find the API keys [here](https://platform.openai.com/account/api-keys)
34"""
35
36ACTIVELOOP_HELP = """
37You can create an Activeloop account (including 200GB of free database storage) [here](https://www.activeloop.ai/).\n
38Once you are logged in, you find the API token [here](https://app.activeloop.ai/profile/gustavz/apitoken).\n
39The organisation name is your username, or you can create new organisations [here](https://app.activeloop.ai/organization/new/create)
40"""
41
42

Concluding Remarks: Build your Chat with Data Tool, or Use DataChad

DataChad elevates conversing with CSVs, PDFs, JSONs, GitHub repositories, local paths or web URLs to a completely new level. If you’ve read this far, consider giving DataChad a try.

By harnessing the power of embeddings, Deep Lake’s vector database for all AI data, large language models (LLMs), and LangChain, DataChad enables users to query any data source easily. DataChad seamlessly transforms any data into text documents, embeds them using OpenAI embeddings, and stores the embeddings as a vector dataset in Activeloop’s Deep Lake Cloud. And creates a LangChain, which serves as the context for generating precise responses to user queries. Whether the task at hand is understanding a complex project or seeking quick answers from a single data source, DataChad allows users to pose natural language questions and receive relevant answers in seconds.

DataChad - Chat with Any Data FAQs

How can I deploy a ChatGPT for my Data fully locally?

If your data is sensitive and you would like to keep it local, you can still use DataChad to chat with your local data locally. Just select the Local Mode in the settings.

Can I deploy chat with data fully on-premise?

Yes, if your enterprise data needs to be fully secure and you’re looking to self-host a “ChatGPT” for your data without giving third-party access, you can deploy DataChad in Local Mode with serverless Deep Lake vector database. With the help of open-source models like GPT4all, you can run the embeddings computation fully locally without the need to send your data to providers like Anthropic or OpenAI.

Can I have ChatGPT to chat with multiple files at the same time?

Yes, DataChad supports chatting with many files at the same time. You can chat with PDFs, text documents, Word documents or CSV files all at the same time.

DataChad: an AI App with LangChain & Deep Lake to Chat with Any Data