Suppose you have some text from an application that you want to use to build an automated helpdesk chatbot. This is readily accomplished using a text-rich data source, a data integration solution and a scripting language. We have put together an example based on Zendesk data hosted in an S3 data lake. It is written in Python, uses OpenAI’s GPT-4 as the foundation model, ChromaDB for a vector database and makes extensive use of the LangChain API to connect these pieces together.
Following the RAG architecture described above, the first leg of this workflow is quite simple; set up a connector from your source – in this case, Zendesk – to your destination using Fivetran.
As discussed in this blog post, a simple start to Fivetran consists of no more than a few steps:
- Sign up for Fivetran
- Set up your Zendesk connector using your credentials. Fivetran connects more than 500 common SaaS, database, event streaming and file storage platforms
- Set up and assign a destination, either a data lake or data warehouse
- Set a schedule and begin an initial sync
[CTA_MODULE]
The next legs of the process involve some engineering work. You will need to accomplish the following steps:
- Extract, transform and vectorize the relevant data from your data repository, loading it to the vector database. You can do this once, persist the results and perform the sync again after enough of the underlying data has changed.
- Set up a user interface that can accept prompts and search for the relevant context from the vector database. Some notable examples of vector databases include ChromaDB, Pinecone and Weaviate.
- Have the retrieval model send the augmented prompts to the foundation model and generate a response.
After completing this tutorial, you will have built a retrieval model that includes the following files. The complete implementation of the project is available in this Github repository.
The Python scripts sync the data to a vector database and contain all the working parts of the retrieval model, while the SQL script instructs the Athena query engine to extract and transform data from the data lake.
1. Prerequisites
To build the retrieval model, let’s start by setting up your development environment.
1. From the command line, create and/or navigate to the directory where you want to build the project
2. Make sure you have Python version 3.9 or above installed
3. Set up a Python virtual environment for this project and activate it
4. To install the necessary libraries, create a file named requirements.txt and add the following requirements to this file.
After this, run the below pip command to install these libraries. Ensure that you are still inside the Python virtual environment while doing so.
We can now move on to implementing the retrieval model. We will rely heavily on libraries from LangChain. Besides accessing the foundation model, LangChain API abstracts away chores such as initially extracting data from a data lake using a query engine, instantiating and accessing a vector database and properly formatting the inputs and outputs for the foundation model.
2. Extracting and transforming the data
In the root of the project directory, create a file named query.sql. This file will contain the SQL query to extract raw Zendesk data from your data lake and transform the one-to-many relationships between entities and comments into one-to-one relationships. This issue is a recurring problem for many structured chat data models that store large blocks of text.
We begin with a common table expression (CTE) starting with “WITH CommentRank As (...”:
The window function “ROW_NUMBER() OVER(PARTITION BY…” collects and sorts all comments on a ticket from oldest to newest, assigning a ranking. Our goal here is to massage a one-ticket-to-many-comments relationship to make the comments easier to combine.
This CTE also concatenates each comment with associated fields that contain important details and context, like the names of users, enabling you to ask the chatbot questions about specific participants. You will produce outputs looking roughly like this:
The subsequent expression beginning with the SELECT statement selects the comments from our CTE and uses LISTAGG and WITHIN GROUP functions to combine these blocks of text together based on the unique identifier. The result of this query is a table of one-to-one textual records.
Below is an illustration of what the query accomplishes:
3. Building the retrieval model
1. Loading the vector DB
The first step in building our retrieval model is to extract the transformed Zendesk data from the data lake, vectorize it and then load it into a vector database. To accomplish this, create a file named load_vec_db.py with the following contents:
You can then execute this script by running it from the command line :
After the first load, you can call it from the command line at any time to update the vector database whenever data in the data lake has changed enough to warrant an update. Note that this ETL is not incremental and may take some time to execute.
2. Implementing RAG
After your Zendesk data has been loaded into the vector database, you can now search for information within this database and perform RAG operations on your data. Create a file named bot.py, where you will implement these operations. First, declare a foundational model and create a prompt that this model can understand.
Once the prompt is declared, implement the vector search to retrieve data relevant to the user input from the vector database. The retrieved data, along with the question, is then augmented with the previously declared prompt before being sent to the LLM for processing.
We will build the form for the front end, which captures user input and returns processed responses, using Streamlit. The following code can be added to the end of the bot.py file.
4. Running and tuning the app
Run the Streamlit app from the command line with:
As you run the Streamlit app and supply it with prompts, it may look like the following:
With Fivetran, your data is automatically refreshed in your data lake or warehouse, and it is a simple matter of calling load_vec_db.py to refresh your vector DB. As you further enrich the vector database with more data, your answers may become more comprehensive and accurate:
This is just the beginning
This retrieval model is only a relatively simple taste of what is possible using Fivetran to integrate text-rich data from SaaS applications and databases. Chatbots are only one possible application of this technology, and there are many ways to refine the example we have provided.
Furthermore, there is a lot of potential for aspects of this process to be further streamlined, modularized or otherwise improved. For instance, the SQL process for extracting and loading may be abstracted out and turned into a library, dbt model, stored procedure or even a software feature with a GUI. Future architectures may combine the capabilities of a data repository and vector database on the same platform, altogether obviating the need for a separate pipeline. There is very likely a technical solution to incrementally updating vector databases. Retrieval from vector databases can be filtered and optimized by topic or context. Models may benefit greatly from incorporating knowledge graphs that encode hard facts, not just statistical word associations.
Stay tuned for future developments – in this fast-growing field, more and more products and tools that further simplify architectures and workflows are coming down the pike, taking the grunt work out of generative AI and allowing easier bootstrapping of innovative AI products.
[CTA_MODULE]