• ActiveLoop
    • Solutions
      Industries
      • agriculture
        Agriculture
      • audio proccesing
        Audio Processing
      • autonomous_vehicles
        Autonomous & Robotics
      • biomedical_healthcare
        Biomedical & Healthcare
      • generative_ai_and_rag
        Generative AI & RAG
      • multimedia
        Multimedia
      • safety_security
        Safety & Security
      Case Studies
      Enterprises
      BayerBiomedical

      Chat with X-Rays. Bye-bye, SQL

      MatterportMultimedia

      Cut data prep time by up to 80%

      Flagship PioneeringBiomedical

      +18% more accurate RAG

      MedTechMedTech

      Fast AI search on 40M+ docs

      Generative AI
      Hercules AIMultimedia

      100x faster queries

      SweepGenAI

      Serverless DB for code assistant

      Ask RogerGenAI

      RAG for multi-modal AI assistant

      Startups
      IntelinairAgriculture

      -50% lower GPU costs & 3x faster

      EarthshotAgriculture

      5x faster with 4x less resources

      UbenwaAudio

      2x faster data preparation

      Tiny MileRobotics

      +19.5% in model accuracy

      Company
      Company
      about
      About
      Learn about our company, its members, and our vision
      Contact Us
      Contact Us
      Get all of your questions answered by our team
      Careers
      Careers
      Build cool things that matter. From anywhere
      Docs
      Resources
      Resources
      blog
      Blog
      Opinion pieces & technology articles
      langchain
      LangChain
      LangChain how-tos with Deep Lake Vector DB
      tutorials
      Tutorials
      Learn how to use Activeloop stack
      glossary
      Glossary
      Top 1000 ML terms explained
      news
      News
      Track company's major milestones
      release notes
      Release Notes
      See what's new?
      Academic Paper
      Deep Lake Academic Paper
      Read the academic paper published in CIDR 2023
      White p\Paper
      Deep Lake White Paper
      See how your company can benefit from Deep Lake
      Free GenAI CoursesSee all
      LangChain & Vector DBs in Production
      LangChain & Vector DBs in Production
      Take AI apps to production
      Train & Fine Tune LLMs
      Train & Fine Tune LLMs
      LLMs from scratch with every method
      Build RAG apps with LlamaIndex & LangChain
      Build RAG apps with LlamaIndex & LangChain
      Advanced retrieval strategies on multi-modal data
      Pricing
  • Book a Demo
DataChad: an AI App with LangChain & Deep Lake to Chat with Any Data
    • Back
      • Tutorials
      • LangChain

    DataChad: an AI App with LangChain & Deep Lake to Chat with Any Data

    DataChad: build an app to chat with multiple data source with LangChain & Deep Lake. Query CSVs, PDFs, URLs, or GitHub Repos fast, both locally or in the cloud.
    • Gustav von ZitzewitzGustav von Zitz...
    19 min readon May 17, 2023Updated May 31, 2023
  • Use LangChain, OpenAI GPT, & Deep Lake to Chat with CSVs, PDFs, JSONs, GitHub Repos, URLs, & More

    datachad chat with data langchain

    We’ve previously explored chatting with PDFs or understanding GitHub repos with LangChain. Many apps are popping up here and there inspired by those use cases, but DataChad, created by our community member Gustav von Zitzewitz, takes it several steps further, and works both locally or in the cloud, and allows chatting with multiple data sources of various types (PDFs, Excel sheets, etc.) at the same time.

    DataChad is an open-source project that allows users to ask questions about any data source by leveraging embeddings, Deep Lake as a vector database, large language models like GPT-3.5-turbo or GPT-4, and LangChain. The data source can be anything from a local file like a pdf or CSV to a website url, a GitHub repository, or even the path to a directory, scanned recursively if the app is deployed locally. The app now supports Local Mode, where all data is processed only locally and no API calls are made. This is made possible by leveraging pre-trained open source LLM models like GPT4all, and creating Deep Lake-powered embedding storage on the local disk vs in the Deep Lake cloud.

    The app works by uploading any file or entering any way or URL (or pointing to the location of your files using the Local Mode). Subsequently, the app detects and loads the data source into text documents, embeds the text documents using OpenAI embeddings, then stores them embeddings as a vector dataset to Activeloop’s Deep Lake Cloud. A Langchain is established, comprising an LLM model and the embedding database index as a retriever. This chain serves as the context for answering user queries over any data they upload.

    Why Do You Need a Chat With Any Data App?

    DataChad is designed to serve as an indispensable tool for individuals who require swift and precise data querying from any source.

    Whether you’re seeking a comprehensive understanding of a complete project or looking for swift answers from a single data source without manually sifting through the material (say, a Wikipedia article, codebase, or an academic paper you’re cramming), with DataChad, users can ask natural language questions and get relevant answers in seconds without writing complex SQL queries or using other data querying tools.

    Finally, the app can be hosted and used from anywhere, like in the demo, or deployed locally to enable querying local directories. This makes it essential to be able to run this types of solutions locally, without the need to send companies like OpenAI your data (in that case, you’d need to use an open-source large language model).

    Editorial Note on OpenAI Embeddings

    Costs can become a factor for extensive OpenAI API usage. To have full transparency and control of this critical factor, DataChad will display the app’s usage of tokens and total costs in $. To get a feeling about the scale, prompts using the maximum number of tokens of 4069 still lay well below a single cent of total usage costs.

    How DataChad Works: Architectural Blueprint for AI-powered Chat with Data App

    OpenAI Embeddings

    DataChad uses OpenAI Embeddings to convert text documents into vectors that can be indexed and searched efficiently. OpenAI’s embeddings are instrumental in evaluating the semantic similarity between two or more text fragments or the relevance of extensive documents to a concise query. They are extensively vital for tasks like search or classification. OpenAI embeddings employ the cosine similarity method to calculate the similarity between documents and a question.

    Vector Database

    DataChad uses Deep Lake - the vector database for all AI data, to store the embeddings generated from the text documents. Vector databases are designed to store and search vectors efficiently and are optimized for large-scale datasets. Deep Lake stands out from various vector databases in its multi-modality (i.e., ability to support multiple data types and store embedding metadata). It is highly relevant if you’re looking to build an all-in-one chat with a data app like DataChad.

    Large Language Models (LLMs)

    DataChad uses large language models like GPT-3.5 Turbo to generate responses to user questions. LLMs are powerful models trained on massive amounts of text data that can generate natural language responses to a wide range of questions.

    LangChain

    DataChad uses LangChain to combine the embeddings and LLMs into a single retrieval chain that can be used to answer user questions. LangChain is a powerful technique for integrating natural language processing tools into a single pipeline. Read this ultimate LangChain guide if you want to understand the power of LangChain.

    Streamlit

    DataChad is implemented as a Streamlit app, a way to build demo apps in Python quickly. It takes the pain away from caring about how to implement a UI and how to host the app properly and lets you focus on the backend work.

    Factors to Consider as You Build a LangChain & Large Language Model-based app (k. arg, chunks, etc.)

    The DataChad project is built upon the fusion of two critical natural language processing (NLP) technologies, leveraging the attention mechanism of Language Model-based Models (LLMs) like GPT-4 through the OpenAI API and employing vector similarity for efficient embedding comparison when querying the vector database. This combination allows for robust analysis and retrieval of information from textual data. Let’s delve into the details, focusing on the querying parameters of the vector database within DataChad.

    The Attention Mechanism of LLMs

    DataChad taps into the attention mechanism offered by LLMs, such as GPT-3, using the OpenAI API. This attention mechanism enables the model to weigh the importance of different words or tokens in a text sequence, capturing contextual relationships and semantic nuances. By leveraging LLMs, DataChad benefits from their ability to generate rich and accurate representations of textual data.

    Vector Similarity for Embedding Comparison

    When querying the vector database, DataChad employs vector similarity to compare document embeddings. This technique measures the geometric similarity between embeddings, allowing for the efficient retrieval of similar documents. Vector similarity provides a simple yet effective method for identifying related content in large-scale datasets.

    Parameters for Querying the Vector Database and the LLM

    DataChad’s querying process involves several important parameters that influence the retrieval and analysis of document embeddings. What are those parameters?

    chunk_size

    chunk_size in LangChain-based apps determines the size at which the text is divided into smaller chunks before being embedded. This parameter ensures the efficient processing of large documents and controls the granularity of the resulting embeddings. The DataChad default is 1000.

    fetch_k

    fetch_k in LangChain-based apps specifies the number of documents to pull from the vector database. This parameter determines the scope of the search and influences the relevance of the retrieved documents. The DataChad default is 20.

    k

    The k in LangChain-based apps represents the most similar embeddings selected to build the context for the LLM prompt in the langchain. This parameter affects the contextual understanding and response generation of the LLM when querying the OpenAI API. The DataChad default is 10.

    max_tokens

    The max_tokens parameter limits the documents returned from the vector store based on tokens before building the context to query the LLM. This parameter ensures that DataChad does not run into the LLM’s prompt limit (4069 for gpt-3.5-turbo). The DataChad default is 3375.

    temperature

    LLM temperature controls the randomness of the LLM output. A temperature of 0 means the response is deterministic: it always returns the same completion (making it significantly less prone to hallucination). A temperature of greater than zero results in increasing variation in the completion. The DataChad default is 0.7.

    By carefully tuning these parameters, DataChad optimizes the trade-off between computational efficiency and the quality of results obtained from both the vector database and LLM-based querying. By ticking the Advanced Options checkbox in the app, experienced users can further modify these parameters.

    How to Solve Most Common Issues When Building With LangChain

    The previous section discussed the importance of selecting appropriate parameters for querying the vector database and the language model within the DataChad project. However, despite the default values having been carefully chosen and tested, it is not uncommon to encounter challenges or the desire for further improvement in the overall query experience. In this section, we will address some common issues you may face as you build your app and provide suggested solutions that can help overcome these challenges.

    Issue 1: Running into errors related to the prompt length

    Solution: Decrease one or many of k, chunk_size, and max_tokens.

    Issue 2: The answers contain hallucinations or do not match the true data content

    Solution: Decrease the temperature. Set it to 0 for the most conservative answers that are unlikely to deviate from the sources.

    Issue 3: The answers are not relevant enough

    Solution: Increase chunk_size, or if this leads to running into issue 1, then increase k and fetch_k while decreasing chunk_size

    Practical Guide: Building an All-In-One Chat with Anything App

    The code is split in three parts. First, we build out the Streamlit app defined in app.py. The second part, utils.py, contains all processing functionality and API calls. Final part is constants.py, where all project-specific paths, names, and descriptions are defined.

    app.py

    1import streamlit as st
    2from streamlit_chat import message
    3
    4from constants import (
    5    ACTIVELOOP_HELP,
    6    APP_NAME,
    7    AUTHENTICATION_HELP,
    8    CHUNK_SIZE,
    9    DEFAULT_DATA_SOURCE,
    10    ENABLE_ADVANCED_OPTIONS,
    11    FETCH_K,
    12    MAX_TOKENS,
    13    OPENAI_HELP,
    14    PAGE_ICON,
    15    REPO_URL,
    16    TEMPERATURE,
    17    USAGE_HELP,
    18    K,
    19)
    20from utils import (
    21    advanced_options_form,
    22    authenticate,
    23    delete_uploaded_file,
    24    generate_response,
    25    logger,
    26    save_uploaded_file,
    27    update_chain,
    28)
    29
    30# Page options and header
    31st.set_option("client.showErrorDetails", True)
    32st.set_page_config(
    33    page_title=APP_NAME, page_icon=PAGE_ICON, initial_sidebar_state="expanded"
    34)
    35st.markdown(
    36    f"<h1 style='text-align: center;'>{APP_NAME} {PAGE_ICON} <br> I know all about your data!</h1>",
    37    unsafe_allow_html=True,
    38)
    39
    40# Initialise session state variables
    41# Chat and Data Source
    42if "past" not in st.session_state:
    43    st.session_state["past"] = []
    44if "usage" not in st.session_state:
    45    st.session_state["usage"] = {}
    46if "chat_history" not in st.session_state:
    47    st.session_state["chat_history"] = []
    48if "generated" not in st.session_state:
    49    st.session_state["generated"] = []
    50if "data_source" not in st.session_state:
    51    st.session_state["data_source"] = DEFAULT_DATA_SOURCE
    52if "uploaded_file" not in st.session_state:
    53    st.session_state["uploaded_file"] = None
    54# Authentication and Credentials
    55if "auth_ok" not in st.session_state:
    56    st.session_state["auth_ok"] = False
    57if "openai_api_key" not in st.session_state:
    58    st.session_state["openai_api_key"] = None
    59if "activeloop_token" not in st.session_state:
    60    st.session_state["activeloop_token"] = None
    61if "activeloop_org_name" not in st.session_state:
    62    st.session_state["activeloop_org_name"] = None
    63# Advanced Options
    64if "k" not in st.session_state:
    65    st.session_state["k"] = K
    66if "fetch_k" not in st.session_state:
    67    st.session_state["fetch_k"] = FETCH_K
    68if "chunk_size" not in st.session_state:
    69    st.session_state["chunk_size"] = CHUNK_SIZE
    70if "temperature" not in st.session_state:
    71    st.session_state["temperature"] = TEMPERATURE
    72if "max_tokens" not in st.session_state:
    73    st.session_state["max_tokens"] = MAX_TOKENS
    74
    75# Sidebar with Authentication
    76# Only start App if authentication is OK
    77with st.sidebar:
    78    st.title("Authentication", help=AUTHENTICATION_HELP)
    79    with st.form("authentication"):
    80        openai_api_key = st.text_input(
    81            "OpenAI API Key",
    82            type="password",
    83            help=OPENAI_HELP,
    84            placeholder="This field is mandatory",
    85        )
    86        activeloop_token = st.text_input(
    87            "ActiveLoop Token",
    88            type="password",
    89            help=ACTIVELOOP_HELP,
    90            placeholder="Optional, using ours if empty",
    91        )
    92        activeloop_org_name = st.text_input(
    93            "ActiveLoop Organisation Name",
    94            type="password",
    95            help=ACTIVELOOP_HELP,
    96            placeholder="Optional, using ours if empty",
    97        )
    98        submitted = st.form_submit_button("Submit")
    99        if submitted:
    100            authenticate(openai_api_key, activeloop_token, activeloop_org_name)
    101
    102    st.info(f"Learn how it works [here]({REPO_URL})")
    103    if not st.session_state["auth_ok"]:
    104        st.stop()
    105
    106    # Clear button to reset all chat communication
    107    clear_button = st.button("Clear Conversation", key="clear")
    108
    109    # Advanced Options
    110    if ENABLE_ADVANCED_OPTIONS:
    111        advanced_options_form()
    112
    113# the chain can only be initialized after authentication is OK
    114if "chain" not in st.session_state:
    115    update_chain()
    116
    117if clear_button:
    118    # resets all chat history related caches
    119    st.session_state["past"] = []
    120    st.session_state["generated"] = []
    121    st.session_state["chat_history"] = []
    122
    123# file upload and data source inputs
    124uploaded_file = st.file_uploader("Upload a file")
    125data_source = st.text_input(
    126    "Enter any data source",
    127    placeholder="Any path or url pointing to a file or directory of files",
    128)
    129
    130# generate new chain for new data source / uploaded file
    131# make sure to do this only once per input / on change
    132if data_source and data_source != st.session_state["data_source"]:
    133    logger.info(f"Data source provided: '{data_source}'")
    134    st.session_state["data_source"] = data_source
    135    update_chain()
    136
    137if uploaded_file and uploaded_file != st.session_state["uploaded_file"]:
    138    logger.info(f"Uploaded file: '{uploaded_file.name}'")
    139    st.session_state["uploaded_file"] = uploaded_file
    140    data_source = save_uploaded_file(uploaded_file)
    141    st.session_state["data_source"] = data_source
    142    update_chain()
    143    delete_uploaded_file(uploaded_file)
    144
    145# container for chat history
    146response_container = st.container()
    147# container for text box
    148container = st.container()
    149
    150# As streamlit reruns the whole script on each change
    151# it is necessary to repopulate the chat containers
    152with container:
    153    with st.form(key="prompt_input", clear_on_submit=True):
    154        user_input = st.text_area("You:", key="input", height=100)
    155        submit_button = st.form_submit_button(label="Send")
    156
    157    if submit_button and user_input:
    158        output = generate_response(user_input)
    159        st.session_state["past"].append(user_input)
    160        st.session_state["generated"].append(output)
    161
    162if st.session_state["generated"]:
    163    with response_container:
    164        for i in range(len(st.session_state["generated"])):
    165            message(st.session_state["past"][i], is_user=True, key=str(i) + "_user")
    166            message(st.session_state["generated"][i], key=str(i))
    167
    168# Usage sidebar with total used tokens and costs
    169# We put this at the end to be able to show usage starting with the first response
    170with st.sidebar:
    171    if st.session_state["usage"]:
    172        st.divider()
    173        st.title("Usage", help=USAGE_HELP)
    174        col1, col2 = st.columns(2)
    175        col1.metric("Total Tokens", st.session_state["usage"]["total_tokens"])
    176        col2.metric("Total Costs in $", st.session_state["usage"]["total_cost"])
    177
    178

    utils.py

    1import logging
    2import os
    3import re
    4import shutil
    5import sys
    6from typing import List
    7
    8import deeplake
    9import openai
    10import streamlit as st
    11from dotenv import load_dotenv
    12from langchain.callbacks import OpenAICallbackHandler, get_openai_callback
    13from langchain.chains import ConversationalRetrievalChain
    14from langchain.chat_models import ChatOpenAI
    15from langchain.document_loaders import (
    16    CSVLoader,
    17    DirectoryLoader,
    18    GitLoader,
    19    NotebookLoader,
    20    OnlinePDFLoader,
    21    PythonLoader,
    22    TextLoader,
    23    UnstructuredFileLoader,
    24    UnstructuredHTMLLoader,
    25    UnstructuredPDFLoader,
    26    UnstructuredWordDocumentLoader,
    27    WebBaseLoader,
    28)
    29from langchain.embeddings.openai import OpenAIEmbeddings
    30from langchain.schema import Document
    31from langchain.text_splitter import RecursiveCharacterTextSplitter
    32from langchain.vectorstores import DeepLake, VectorStore
    33from streamlit.runtime.uploaded_file_manager import UploadedFile
    34
    35from constants import (
    36    APP_NAME,
    37    CHUNK_SIZE,
    38    DATA_PATH,
    39    FETCH_K,
    40    MAX_TOKENS,
    41    MODEL,
    42    PAGE_ICON,
    43    REPO_URL,
    44    TEMPERATURE,
    45    K,
    46)
    47
    48# loads environment variables
    49load_dotenv()
    50
    51logger = logging.getLogger(APP_NAME)
    52
    53def configure_logger(debug: int = 0) -> None:
    54    # boilerplate code to enable logging in the streamlit app console
    55    log_level = logging.DEBUG if debug == 1 else logging.INFO
    56    logger.setLevel(log_level)
    57
    58    stream_handler = logging.StreamHandler(stream=sys.stdout)
    59    stream_handler.setLevel(log_level)
    60
    61    formatter = logging.Formatter("%(message)s")
    62
    63    stream_handler.setFormatter(formatter)
    64
    65    logger.addHandler(stream_handler)
    66    logger.propagate = False
    67
    68configure_logger(0)
    69
    70def authenticate(
    71    openai_api_key: str, activeloop_token: str, activeloop_org_name: str
    72) -> None:
    73    # Validate all credentials are set and correct
    74    # Check for env variables to enable local dev and deployments with shared credentials
    75    openai_api_key = (
    76        openai_api_key
    77        or os.environ.get("OPENAI_API_KEY")
    78        or st.secrets.get("OPENAI_API_KEY")
    79    )
    80    activeloop_token = (
    81        activeloop_token
    82        or os.environ.get("ACTIVELOOP_TOKEN")
    83        or st.secrets.get("ACTIVELOOP_TOKEN")
    84    )
    85    activeloop_org_name = (
    86        activeloop_org_name
    87        or os.environ.get("ACTIVELOOP_ORG_NAME")
    88        or st.secrets.get("ACTIVELOOP_ORG_NAME")
    89    )
    90    if not (openai_api_key and activeloop_token and activeloop_org_name):
    91        st.session_state["auth_ok"] = False
    92        st.error("Credentials neither set nor stored", icon=PAGE_ICON)
    93        return
    94    try:
    95        # Try to access openai and deeplake
    96        with st.spinner("Authentifying..."):
    97            openai.api_key = openai_api_key
    98            openai.Model.list()
    99            deeplake.exists(
    100                f"hub://{activeloop_org_name}/DataChad-Authentication-Check",
    101                token=activeloop_token,
    102            )
    103    except Exception as e:
    104        logger.error(f"Authentication failed with {e}")
    105        st.session_state["auth_ok"] = False
    106        st.error("Authentication failed", icon=PAGE_ICON)
    107        return
    108    # store credentials in the session state
    109    st.session_state["auth_ok"] = True
    110    st.session_state["openai_api_key"] = openai_api_key
    111    st.session_state["activeloop_token"] = activeloop_token
    112    st.session_state["activeloop_org_name"] = activeloop_org_name
    113    logger.info("Authentification successful!")
    114
    115def advanced_options_form() -> None:
    116    # Input Form that takes advanced options and rebuilds chain with them
    117    advanced_options = st.checkbox(
    118        "Advanced Options", help="Caution! This may break things!"
    119    )
    120    if advanced_options:
    121        with st.form("advanced_options"):
    122            temperature = st.slider(
    123                "temperature",
    124                min_value=0.0,
    125                max_value=1.0,
    126                value=TEMPERATURE,
    127                help="Controls the randomness of the language model output",
    128            )
    129            col1, col2 = st.columns(2)
    130            fetch_k = col1.number_input(
    131                "k_fetch",
    132                min_value=1,
    133                max_value=1000,
    134                value=FETCH_K,
    135                help="The number of documents to pull from the vector database",
    136            )
    137            k = col2.number_input(
    138                "k",
    139                min_value=1,
    140                max_value=100,
    141                value=K,
    142                help="The number of most similar documents to build the context from",
    143            )
    144            chunk_size = col1.number_input(
    145                "chunk_size",
    146                min_value=1,
    147                max_value=100000,
    148                value=CHUNK_SIZE,
    149                help=(
    150                    "The size at which the text is divided into smaller chunks "
    151                    "before being embedded.\n\nChanging this parameter makes re-embedding "
    152                    "and re-uploading the data to the database necessary "
    153                ),
    154            )
    155            max_tokens = col2.number_input(
    156                "max_tokens",
    157                min_value=1,
    158                max_value=4069,
    159                value=MAX_TOKENS,
    160                help="Limits the documents returned from database based on number of tokens",
    161            )
    162            applied = st.form_submit_button("Apply")
    163            if applied:
    164                st.session_state["k"] = k
    165                st.session_state["fetch_k"] = fetch_k
    166                st.session_state["chunk_size"] = chunk_size
    167                st.session_state["temperature"] = temperature
    168                st.session_state["max_tokens"] = max_tokens
    169                update_chain()
    170
    171def save_uploaded_file(uploaded_file: UploadedFile) -> str:
    172    # streamlit uploaded files need to be stored locally
    173    # before embedded and uploaded to the hub
    174    if not os.path.exists(DATA_PATH):
    175        os.makedirs(DATA_PATH)
    176    file_path = str(DATA_PATH / uploaded_file.name)
    177    uploaded_file.seek(0)
    178    file_bytes = uploaded_file.read()
    179    file = open(file_path, "wb")
    180    file.write(file_bytes)
    181    file.close()
    182    logger.info(f"Saved: {file_path}")
    183    return file_path
    184
    185def delete_uploaded_file(uploaded_file: UploadedFile) -> None:
    186    # cleanup locally stored files
    187    file_path = DATA_PATH / uploaded_file.name
    188    if os.path.exists(DATA_PATH):
    189        os.remove(file_path)
    190        logger.info(f"Removed: {file_path}")
    191
    192def handle_load_error(e: str = None) -> None:
    193    e = e or "No Loader found for your data source. Consider contributing:  {REPO_URL}!"
    194    error_msg = f"Failed to load {st.session_state['data_source']} with Error:\n{e}"
    195    st.error(error_msg, icon=PAGE_ICON)
    196    logger.info(error_msg)
    197    st.stop()
    198
    199def load_git(data_source: str, chunk_size: int = CHUNK_SIZE) -> List[Document]:
    200    # We need to try both common main branches
    201    # Thank you github for the "master" to "main" switch
    202    repo_name = data_source.split("/")[-1].split(".")[0]
    203    repo_path = str(DATA_PATH / repo_name)
    204    text_splitter = RecursiveCharacterTextSplitter(
    205        chunk_size=chunk_size, chunk_overlap=0
    206    )
    207    branches = ["main", "master"]
    208    for branch in branches:
    209        if os.path.exists(repo_path):
    210            data_source = None
    211        try:
    212            docs = GitLoader(repo_path, data_source, branch).load_and_split(
    213                text_splitter
    214            )
    215            break
    216        except Exception as e:
    217            logger.info(f"Error loading git: {e}")
    218    if os.path.exists(repo_path):
    219        # cleanup repo afterwards
    220        shutil.rmtree(repo_path)
    221    try:
    222        return docs
    223    except Exception as e:
    224        handle_load_error()
    225
    226def load_any_data_source(
    227    data_source: str, chunk_size: int = CHUNK_SIZE
    228) -> List[Document]:
    229    # Ugly thing that decides how to load data
    230    # It aint much, but it's honest work
    231    is_text = data_source.endswith(".txt")
    232    is_web = data_source.startswith("http")
    233    is_pdf = data_source.endswith(".pdf")
    234    is_csv = data_source.endswith("csv")
    235    is_html = data_source.endswith(".html")
    236    is_git = data_source.endswith(".git")
    237    is_notebook = data_source.endswith(".ipynb")
    238    is_doc = data_source.endswith(".doc")
    239    is_py = data_source.endswith(".py")
    240    is_dir = os.path.isdir(data_source)
    241    is_file = os.path.isfile(data_source)
    242
    243    loader = None
    244    if is_dir:
    245        loader = DirectoryLoader(data_source, recursive=True, silent_errors=True)
    246    elif is_git:
    247        return load_git(data_source, chunk_size)
    248    elif is_web:
    249        if is_pdf:
    250            loader = OnlinePDFLoader(data_source)
    251        else:
    252            loader = WebBaseLoader(data_source)
    253    elif is_file:
    254        if is_text:
    255            loader = TextLoader(data_source)
    256        elif is_notebook:
    257            loader = NotebookLoader(data_source)
    258        elif is_pdf:
    259            loader = UnstructuredPDFLoader(data_source)
    260        elif is_html:
    261            loader = UnstructuredHTMLLoader(data_source)
    262        elif is_doc:
    263            loader = UnstructuredWordDocumentLoader(data_source)
    264        elif is_csv:
    265            loader = CSVLoader(data_source, encoding="utf-8")
    266        elif is_py:
    267            loader = PythonLoader(data_source)
    268        else:
    269            loader = UnstructuredFileLoader(data_source)
    270    try:
    271        # Chunk size is a major trade-off parameter to control result accuracy over computaion
    272        text_splitter = RecursiveCharacterTextSplitter(
    273            chunk_size=chunk_size, chunk_overlap=0
    274        )
    275        docs = loader.load_and_split(text_splitter)
    276        logger.info(f"Loaded: {len(docs)} document chucks")
    277        return docs
    278    except Exception as e:
    279        handle_load_error(e if loader else None)
    280
    281def clean_data_source_string(data_source_string: str) -> str:
    282    # replace all non-word characters with dashes
    283    # to get a string that can be used to create a new dataset
    284    dashed_string = re.sub(r"\W+", "-", data_source_string)
    285    cleaned_string = re.sub(r"--+", "- ", dashed_string).strip("-")
    286    return cleaned_string
    287
    288def setup_vector_store(data_source: str, chunk_size: int = CHUNK_SIZE) -> VectorStore:
    289    # either load existing vector store or upload a new one to the hub
    290    embeddings = OpenAIEmbeddings(
    291        disallowed_special=(), openai_api_key=st.session_state["openai_api_key"]
    292    )
    293    data_source_name = clean_data_source_string(data_source)
    294    dataset_path = f"hub://{st.session_state['activeloop_org_name']}/{data_source_name}-{chunk_size}"
    295    if deeplake.exists(dataset_path, token=st.session_state["activeloop_token"]):
    296        with st.spinner("Loading vector store..."):
    297            logger.info(f"Dataset '{dataset_path}' exists -> loading")
    298            vector_store = DeepLake(
    299                dataset_path=dataset_path,
    300                read_only=True,
    301                embedding_function=embeddings,
    302                token=st.session_state["activeloop_token"],
    303            )
    304    else:
    305        with st.spinner("Reading, embedding and uploading data to hub..."):
    306            logger.info(f"Dataset '{dataset_path}' does not exist -> uploading")
    307            docs = load_any_data_source(data_source, chunk_size)
    308            vector_store = DeepLake.from_documents(
    309                docs,
    310                embeddings,
    311                dataset_path=dataset_path,
    312                token=st.session_state["activeloop_token"],
    313            )
    314    return vector_store
    315
    316def build_chain(
    317    data_source: str,
    318    k: int = K,
    319    fetch_k: int = FETCH_K,
    320    chunk_size: int = CHUNK_SIZE,
    321    temperature: float = TEMPERATURE,
    322    max_tokens: int = MAX_TOKENS,
    323) -> ConversationalRetrievalChain:
    324    # create the langchain that will be called to generate responses
    325    vector_store = setup_vector_store(data_source, chunk_size)
    326    retriever = vector_store.as_retriever()
    327    # Search params "fetch_k" and "k" define how many documents are pulled from the hub
    328    # and selected after the document matching to build the context
    329    # that is fed to the model together with your prompt
    330    search_kwargs = {
    331        "maximal_marginal_relevance": True,
    332        "distance_metric": "cos",
    333        "fetch_k": fetch_k,
    334        "k": k,
    335    }
    336    retriever.search_kwargs.update(search_kwargs)
    337    model = ChatOpenAI(
    338        model_name=MODEL,
    339        temperature=temperature,
    340        openai_api_key=st.session_state["openai_api_key"],
    341    )
    342    chain = ConversationalRetrievalChain.from_llm(
    343        model,
    344        retriever=retriever,
    345        chain_type="stuff",
    346        verbose=True,
    347        # we limit the maximum number of used tokens
    348        # to prevent running into the models token limit of 4096
    349        max_tokens_limit=max_tokens,
    350    )
    351    logger.info(f"Data source '{data_source}' is ready to go!")
    352    return chain
    353
    354def update_chain() -> None:
    355    # Build chain with parameters from session state and store it back
    356    # Also delete chat history to not confuse the bot with old context
    357    try:
    358        st.session_state["chain"] = build_chain(
    359            data_source=st.session_state["data_source"],
    360            k=st.session_state["k"],
    361            fetch_k=st.session_state["fetch_k"],
    362            chunk_size=st.session_state["chunk_size"],
    363            temperature=st.session_state["temperature"],
    364            max_tokens=st.session_state["max_tokens"],
    365        )
    366        st.session_state["chat_history"] = []
    367    except Exception as e:
    368        msg = f"Failed to build chain for data source {st.session_state['data_source']} with error: {e}"
    369        logger.error(msg)
    370        st.error(msg, icon=PAGE_ICON)
    371
    372def update_usage(cb: OpenAICallbackHandler) -> None:
    373    # Accumulate API call usage via callbacks
    374    logger.info(f"Usage: {cb}")
    375    callback_properties = [
    376        "total_tokens",
    377        "prompt_tokens",
    378        "completion_tokens",
    379        "total_cost",
    380    ]
    381    for prop in callback_properties:
    382        value = getattr(cb, prop, 0)
    383        st.session_state["usage"].setdefault(prop, 0)
    384        st.session_state["usage"][prop] += value
    385
    386def generate_response(prompt: str) -> str:
    387    # call the chain to generate responses and add them to the chat history
    388    with st.spinner("Generating response"), get_openai_callback() as cb:
    389        response = st.session_state["chain"](
    390            {"question": prompt, "chat_history": st.session_state["chat_history"]}
    391        )
    392        update_usage(cb)
    393    logger.info(f"Response: '{response}'")
    394    st.session_state["chat_history"].append((prompt, response["answer"]))
    395    return response["answer"]
    396
    397

    constants.py

    1from pathlib import Path
    2
    3APP_NAME = "DataChad"
    4MODEL = "gpt-3.5-turbo"
    5PAGE_ICON = "🤖"
    6
    7K = 10
    8FETCH_K = 20
    9CHUNK_SIZE = 1000
    10TEMPERATURE = 0.7
    11MAX_TOKENS = 3357
    12ENABLE_ADVANCED_OPTIONS = True
    13
    14DATA_PATH = Path.cwd() / "data"
    15DEFAULT_DATA_SOURCE = "[email protected]:gustavz/DataChad.git"
    16
    17REPO_URL = "https://github.com/gustavz/DataChad"
    18
    19AUTHENTICATION_HELP = f"""
    20Your credentials are only stored in your session state.\n
    21The keys are neither exposed nor made visible or stored permanently in any way.\n
    22Feel free to check out [the code base]({REPO_URL}) to validate how things work.
    23"""
    24
    25USAGE_HELP = f"""
    26These are the accumulated OpenAI API usage metrics.\n
    27The app uses '{MODEL}' for chat and 'text-embedding-ada-002' for embeddings.\n
    28Learn more about OpenAI's pricing [here](https://openai.com/pricing#language-models)
    29"""
    30
    31OPENAI_HELP = """
    32You can sign-up for OpenAI's API [here](https://openai.com/blog/openai-api).\n
    33Once you are logged in, you find the API keys [here](https://platform.openai.com/account/api-keys)
    34"""
    35
    36ACTIVELOOP_HELP = """
    37You can create an Activeloop account (including 200GB of free database storage) [here](https://www.activeloop.ai/).\n
    38Once you are logged in, you find the API token [here](https://app.activeloop.ai/profile/gustavz/apitoken).\n
    39The organisation name is your username, or you can create new organisations [here](https://app.activeloop.ai/organization/new/create)
    40"""
    41
    42

    Concluding Remarks: Build your Chat with Data Tool, or Use DataChad

    DataChad elevates conversing with CSVs, PDFs, JSONs, GitHub repositories, local paths or web URLs to a completely new level. If you’ve read this far, consider giving DataChad a try.

    By harnessing the power of embeddings, Deep Lake’s vector database for all AI data, large language models (LLMs), and LangChain, DataChad enables users to query any data source easily. DataChad seamlessly transforms any data into text documents, embeds them using OpenAI embeddings, and stores the embeddings as a vector dataset in Activeloop’s Deep Lake Cloud. And creates a LangChain, which serves as the context for generating precise responses to user queries. Whether the task at hand is understanding a complex project or seeking quick answers from a single data source, DataChad allows users to pose natural language questions and receive relevant answers in seconds.

    DataChad - Chat with Any Data FAQs

    How can I deploy a ChatGPT for my Data fully locally?


    If your data is sensitive and you would like to keep it local, you can still use DataChad to chat with your local data locally. Just select the Local Mode in the settings.

    Can I deploy chat with data fully on-premise?


    Yes, if your enterprise data needs to be fully secure and you’re looking to self-host a “ChatGPT” for your data without giving third-party access, you can deploy DataChad in Local Mode with serverless Deep Lake vector database. With the help of open-source models like GPT4all, you can run the embeddings computation fully locally without the need to send your data to providers like Anthropic or OpenAI.

    Can I have ChatGPT to chat with multiple files at the same time?


    Yes, DataChad supports chatting with many files at the same time. You can chat with PDFs, text documents, Word documents or CSV files all at the same time.

    Share:

    • Table of Contents
    • Use LangChain, OpenAI GPT, & Deep Lake to Chat with CSVs, PDFs, JSONs, GitHub Repos, URLs, & More
    • Why Do You Need a Chat With Any Data App?
    • Editorial Note on OpenAI Embeddings
    • How DataChad Works: Architectural Blueprint for AI-powered Chat with Data App
    • OpenAI Embeddings
    • Vector Database
    • Large Language Models (LLMs)
    • LangChain
    • Streamlit
    • Factors to Consider as You Build a LangChain & Large Language Model-based app (k. arg, chunks, etc.)
    • The Attention Mechanism of LLMs
    • Vector Similarity for Embedding Comparison
    • Parameters for Querying the Vector Database and the LLM
    • chunk_size
    • fetch_k
    • k
    • max_tokens
    • temperature
    • How to Solve Most Common Issues When Building With LangChain
    • Issue 1: Running into errors related to the prompt length
    • Issue 2: The answers contain hallucinations or do not match the true data content
    • Issue 3: The answers are not relevant enough
    • Practical Guide: Building an All-In-One Chat with Anything App
    • app.py
    • utils.py
    • constants.py
    • Concluding Remarks: Build your Chat with Data Tool, or Use DataChad
    • DataChad - Chat with Any Data FAQs
    • How can I deploy a ChatGPT for my Data fully locally?
    • Can I deploy chat with data fully on-premise?
    • Can I have ChatGPT to chat with multiple files at the same time?
    • Previous
        • Tutorials
        • News
      • How to access Google Objectron Dataset in Less Than 5 Seconds

      • on May 3, 2021
    • Next
        • Tutorials
        • LangChain
      • 3 Ways to Build a Recommendation Engine for Songs with LangChain

      • on May 23, 2023
  • deep lake database

    Deep Lake. Database for AI.

    • Solutions
      AgricultureAudio ProcessingAutonomous Vehicles & RoboticsBiomedical & HealthcareMultimediaSafety & Security
    • Company
      AboutContact UsCareersPrivacy PolicyDo Not SellTerms & Conditions
    • Resources
      BlogDocumentationDeep Lake WhitepaperDeep Lake Academic Paper
  • Tensie

    Featured by

    featuredfeaturedfeaturedfeatured