Building a chatbot with Fivetran and LangChain

Build your own generative AI chatbot! This post is a continuation of our earlier blog post, “Assembling a RAG architecture using Fivetran.”

[CTA_MODULE]

Suppose you have some text from an application that you want to use to build an automated helpdesk chatbot. This is readily accomplished using a text-rich data source, a data integration solution and a scripting language. We have put together an example based on Zendesk data hosted in an S3 data lake. It is written in Python, uses OpenAI’s GPT-4 as the foundation model, ChromaDB for a vector database and makes extensive use of the LangChain API to connect these pieces together.

Following the RAG architecture described above, the first leg of this workflow is quite simple; set up a connector from your source – in this case, Zendesk – to your destination using Fivetran. 

As discussed in this blog post, a simple start to Fivetran consists of no more than a few steps:

  1. Sign up for Fivetran
  2. Set up your Zendesk connector using your credentials. Fivetran connects more than 500 common SaaS, database, event streaming and file storage platforms
  3. Set up and assign a destination, either a data lake or data warehouse 
  4. Set a schedule and begin an initial sync

The next legs of the process involve some engineering work. You will need to accomplish the following steps:

  1. Extract, transform and vectorize the relevant data from your data repository, loading it to the vector database. You can do this once, persist the results and perform the sync again after enough of the underlying data has changed.
  2. Set up a user interface that can accept prompts and search for the relevant context from the vector database. Some notable examples of vector databases include ChromaDB, Pinecone and Weaviate.
  3. Have the retrieval model send the augmented prompts to the foundation model and generate a response.

After completing this tutorial, you will have built a retrieval model that includes the following files. The complete implementation of the project is available in this Github repository

zendesk_qnabot
|-- bot.py
|-- load_vec_db.py
|-- query.sql
|-- requirements.txt

The Python scripts sync the data to a vector database and contain all the working parts of the retrieval model, while the SQL script instructs the Athena query engine to extract and transform data from the data lake. 

1. Prerequisites

To build the retrieval model, let’s start by setting up your development environment.

1. From the command line, create and/or navigate to the directory where you want to build the project

mkdir <path>
cd <path>

2. Make sure you have Python version 3.9 or above installed


python --version

3. Set up a Python virtual environment for this project and activate it


virtualenv venv
source venv/bin/activate

4. To install the necessary libraries, create a file named requirements.txt and add the following requirements to this file.


boto3
chromadb
langchain
langchain-community
langchain-openai
openai
streamlit
tiktoken

After this, run the below pip command to install these libraries. Ensure that you are still inside the Python virtual environment while doing so.


pip install -r requirements.txt

We can now move on to implementing the retrieval model. We will rely heavily on libraries from LangChain. Besides accessing the foundation model, LangChain API abstracts away chores such as initially extracting data from a data lake using a query engine, instantiating and accessing a vector database and properly formatting the inputs and outputs for the foundation model.

2. Extracting and transforming the data

In the root of the project directory, create a file named query.sql. This file will contain the SQL query to extract raw Zendesk data from your data lake and transform the one-to-many relationships between entities and comments into one-to-one relationships. This issue is a recurring problem for many structured chat data models that store large blocks of text.

We begin with a common table expression (CTE) starting with “WITH CommentRank As (...”: 


WITH CommentRank AS (
	SELECT        
  tc.ticket_id,        
  tc.created AS comment_created,        
  FORMAT(            
  	'Ticket Comment %d\nCreated On : %s\nCreated By : %s (%s), Role: %s\n%s\n%s\n%s\n',            
    	ROW_NUMBER() OVER(PARTITION BY tc.ticket_id ORDER BY tc.created),            
      CAST(tc.created AS VARCHAR),            
      u.name,            
      u.email,            
      u.role,            
      '-------------------------',            
      tc.body,            
      '-------------------------'        
		) AS formatted_comment    
  FROM        
  	TICKET_COMMENT tc    
  JOIN        
  	USER u ON tc.user_id = u.id
)

The window function “ROW_NUMBER() OVER(PARTITION BY…” collects and sorts all comments on a ticket from oldest to newest, assigning a ranking. Our goal here is to massage a one-ticket-to-many-comments relationship to make the comments easier to combine.

This CTE also concatenates each comment with associated fields that contain important details and context, like the names of users, enabling you to ask the chatbot questions about specific participants. You will produce outputs looking roughly like this:

The subsequent expression beginning with the SELECT statement selects the comments from our CTE and uses LISTAGG and WITHIN GROUP functions to combine these blocks of text together based on the unique identifier. The result of this query is a table of one-to-one textual records.


SELECT    
	t.id as ticket_id,    
  t.subject as ticket_subject,    
  t.created_at as ticket_created_at,    
  FORMAT(        
  '\nTicket ID : %d\nCreated On : %s\nSubject : %s\nStatus : %s\nPriority : %s\n\n%s',        
  	t.id,        
    CAST(t.created_at AS VARCHAR),        
    	t.subject,        
      t.status,        
      t.priority,        
      LISTAGG(cr.formatted_comment, '\n') WITHIN GROUP (ORDER BY cr.comment_created)    ) AS ticket_details
FROM    
	TICKET t
LEFT JOIN    
	CommentRank cr ON t.id = cr.ticket_id
GROUP BY    
	t.id, t.subject, t.created_at, t.status, t.priority

Below is an illustration of what the query accomplishes:

3. Building the retrieval model


1. Loading the vector DB

The first step in building our retrieval model is to extract the transformed Zendesk data from the data lake, vectorize it and then load it into a vector database. To accomplish this, create a file named load_vec_db.py with the following contents:


from langchain_community.document_loaders.athena import AthenaLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbedding

import os
os.environ['OPENAI_API_KEY'] = 'sk-****'
with open('query.sql', 'r') as q:    
	query = q.read()
  
docs = AthenaLoader(   
	query=query,    
  database="zendesk",    
  s3_output_uri="s3://asimov-datalake-s3/query_results/",    
  profile_name="datasharing",    
  metadata_columns=["ticket_id", "ticket_subject", "ticket_created_at"]).load()
  
embeddings = OpenAIEmbeddings()
vecdb = Chroma.from_documents(docs, embeddings, persist_directory="./db")
vecdb.persist()


You can then execute this script by running it from the command line :


python load_vec_db.py

After the first load, you can call it from the command line at any time to update the vector database whenever data in the data lake has changed enough to warrant an update. Note that this ETL is not incremental and may take some time to execute. 

2. Implementing RAG

After your Zendesk data has been loaded into the vector database, you can now search for information within this database and perform RAG operations on your data. Create a file named bot.py, where you will implement these operations. First, declare a foundational model and create a prompt that this model can understand.


from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

template = """
You are an assistant tasked with answering questions based on Zendesk chat threads.Refer to the following chat threads to provide your response.Keep your answer concise, using a maximum of three sentences.
Question: {question}
Chat Threads: {chat_threads}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model="gpt-4-0125-preview")

Once the prompt is declared, implement the vector search to retrieve data relevant to the user input from the vector database. The retrieved data, along with the question, is then augmented with the previously declared prompt before being sent to the LLM for processing.


from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

def get_answer(question):    
	db = Chroma(persist_directory="./db", embedding_function=OpenAIEmbeddings())    
  inputs = {"chat_threads": db.as_retriever(), "question": RunnablePassthrough()}    
  rag_chain = (inputs | prompt | llm | StrOutputParser())    
  return rag_chain.invoke(question)

We will build the form for the front end, which captures user input and returns processed responses, using Streamlit. The following code can be added to the end of the bot.py file.


import streamlit as st

st.title("Zendesk - Q& A Bot")

with st.form("my_form"):    
	sample_question = "What is the biggest issue with datalakes?"    
  question = st.text_area("Enter text:", sample_question)    
  submitted = st.form_submit_button("Submit")    
  answer = get_answer(question)    
  st.info(answer)

4. Running and tuning the app

Run the Streamlit app from the command line with:


streamlit run bot.py 

As you run the Streamlit app and supply it with prompts, it may look like the following:

With Fivetran, your data is automatically refreshed in your data lake or warehouse, and it is a simple matter of calling load_vec_db.py to refresh your vector DB. As you further enrich the vector database with more data, your answers may become more comprehensive and accurate:

This is just the beginning

This retrieval model is only a relatively simple taste of what is possible using Fivetran to integrate text-rich data from SaaS applications and databases. Chatbots are only one possible application of this technology, and there are many ways to refine the example we have provided.

Furthermore, there is a lot of potential for aspects of this process to be further streamlined, modularized or otherwise improved. For instance, the SQL process for extracting and loading may be abstracted out and turned into a library, dbt model, stored procedure or even a software feature with a GUI. Future architectures may combine the capabilities of a data repository and vector database on the same platform, altogether obviating the need for a separate pipeline. There is very likely a technical solution to incrementally updating vector databases. Retrieval from vector databases can be filtered and optimized by topic or context. Models may benefit greatly from incorporating knowledge graphs that encode hard facts, not just statistical word associations.

Stay tuned for future developments – in this fast-growing field, more and more products and tools that further simplify architectures and workflows are coming down the pike, taking the grunt work out of generative AI and allowing easier bootstrapping of innovative AI products.

[CTA_MODULE]

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

Building a chatbot with Fivetran and LangChain

Building a chatbot with Fivetran and LangChain

March 26, 2024
March 26, 2024
Building a chatbot with Fivetran and LangChain
Build your own generative AI chatbot! This post is a continuation of our earlier blog post, “Assembling a RAG architecture using Fivetran.”

[CTA_MODULE]

Suppose you have some text from an application that you want to use to build an automated helpdesk chatbot. This is readily accomplished using a text-rich data source, a data integration solution and a scripting language. We have put together an example based on Zendesk data hosted in an S3 data lake. It is written in Python, uses OpenAI’s GPT-4 as the foundation model, ChromaDB for a vector database and makes extensive use of the LangChain API to connect these pieces together.

Following the RAG architecture described above, the first leg of this workflow is quite simple; set up a connector from your source – in this case, Zendesk – to your destination using Fivetran. 

As discussed in this blog post, a simple start to Fivetran consists of no more than a few steps:

  1. Sign up for Fivetran
  2. Set up your Zendesk connector using your credentials. Fivetran connects more than 500 common SaaS, database, event streaming and file storage platforms
  3. Set up and assign a destination, either a data lake or data warehouse 
  4. Set a schedule and begin an initial sync

The next legs of the process involve some engineering work. You will need to accomplish the following steps:

  1. Extract, transform and vectorize the relevant data from your data repository, loading it to the vector database. You can do this once, persist the results and perform the sync again after enough of the underlying data has changed.
  2. Set up a user interface that can accept prompts and search for the relevant context from the vector database. Some notable examples of vector databases include ChromaDB, Pinecone and Weaviate.
  3. Have the retrieval model send the augmented prompts to the foundation model and generate a response.

After completing this tutorial, you will have built a retrieval model that includes the following files. The complete implementation of the project is available in this Github repository

zendesk_qnabot
|-- bot.py
|-- load_vec_db.py
|-- query.sql
|-- requirements.txt

The Python scripts sync the data to a vector database and contain all the working parts of the retrieval model, while the SQL script instructs the Athena query engine to extract and transform data from the data lake. 

1. Prerequisites

To build the retrieval model, let’s start by setting up your development environment.

1. From the command line, create and/or navigate to the directory where you want to build the project

mkdir <path>
cd <path>

2. Make sure you have Python version 3.9 or above installed


python --version

3. Set up a Python virtual environment for this project and activate it


virtualenv venv
source venv/bin/activate

4. To install the necessary libraries, create a file named requirements.txt and add the following requirements to this file.


boto3
chromadb
langchain
langchain-community
langchain-openai
openai
streamlit
tiktoken

After this, run the below pip command to install these libraries. Ensure that you are still inside the Python virtual environment while doing so.


pip install -r requirements.txt

We can now move on to implementing the retrieval model. We will rely heavily on libraries from LangChain. Besides accessing the foundation model, LangChain API abstracts away chores such as initially extracting data from a data lake using a query engine, instantiating and accessing a vector database and properly formatting the inputs and outputs for the foundation model.

2. Extracting and transforming the data

In the root of the project directory, create a file named query.sql. This file will contain the SQL query to extract raw Zendesk data from your data lake and transform the one-to-many relationships between entities and comments into one-to-one relationships. This issue is a recurring problem for many structured chat data models that store large blocks of text.

We begin with a common table expression (CTE) starting with “WITH CommentRank As (...”: 


WITH CommentRank AS (
	SELECT        
  tc.ticket_id,        
  tc.created AS comment_created,        
  FORMAT(            
  	'Ticket Comment %d\nCreated On : %s\nCreated By : %s (%s), Role: %s\n%s\n%s\n%s\n',            
    	ROW_NUMBER() OVER(PARTITION BY tc.ticket_id ORDER BY tc.created),            
      CAST(tc.created AS VARCHAR),            
      u.name,            
      u.email,            
      u.role,            
      '-------------------------',            
      tc.body,            
      '-------------------------'        
		) AS formatted_comment    
  FROM        
  	TICKET_COMMENT tc    
  JOIN        
  	USER u ON tc.user_id = u.id
)

The window function “ROW_NUMBER() OVER(PARTITION BY…” collects and sorts all comments on a ticket from oldest to newest, assigning a ranking. Our goal here is to massage a one-ticket-to-many-comments relationship to make the comments easier to combine.

This CTE also concatenates each comment with associated fields that contain important details and context, like the names of users, enabling you to ask the chatbot questions about specific participants. You will produce outputs looking roughly like this:

The subsequent expression beginning with the SELECT statement selects the comments from our CTE and uses LISTAGG and WITHIN GROUP functions to combine these blocks of text together based on the unique identifier. The result of this query is a table of one-to-one textual records.


SELECT    
	t.id as ticket_id,    
  t.subject as ticket_subject,    
  t.created_at as ticket_created_at,    
  FORMAT(        
  '\nTicket ID : %d\nCreated On : %s\nSubject : %s\nStatus : %s\nPriority : %s\n\n%s',        
  	t.id,        
    CAST(t.created_at AS VARCHAR),        
    	t.subject,        
      t.status,        
      t.priority,        
      LISTAGG(cr.formatted_comment, '\n') WITHIN GROUP (ORDER BY cr.comment_created)    ) AS ticket_details
FROM    
	TICKET t
LEFT JOIN    
	CommentRank cr ON t.id = cr.ticket_id
GROUP BY    
	t.id, t.subject, t.created_at, t.status, t.priority

Below is an illustration of what the query accomplishes:

3. Building the retrieval model


1. Loading the vector DB

The first step in building our retrieval model is to extract the transformed Zendesk data from the data lake, vectorize it and then load it into a vector database. To accomplish this, create a file named load_vec_db.py with the following contents:


from langchain_community.document_loaders.athena import AthenaLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbedding

import os
os.environ['OPENAI_API_KEY'] = 'sk-****'
with open('query.sql', 'r') as q:    
	query = q.read()
  
docs = AthenaLoader(   
	query=query,    
  database="zendesk",    
  s3_output_uri="s3://asimov-datalake-s3/query_results/",    
  profile_name="datasharing",    
  metadata_columns=["ticket_id", "ticket_subject", "ticket_created_at"]).load()
  
embeddings = OpenAIEmbeddings()
vecdb = Chroma.from_documents(docs, embeddings, persist_directory="./db")
vecdb.persist()


You can then execute this script by running it from the command line :


python load_vec_db.py

After the first load, you can call it from the command line at any time to update the vector database whenever data in the data lake has changed enough to warrant an update. Note that this ETL is not incremental and may take some time to execute. 

2. Implementing RAG

After your Zendesk data has been loaded into the vector database, you can now search for information within this database and perform RAG operations on your data. Create a file named bot.py, where you will implement these operations. First, declare a foundational model and create a prompt that this model can understand.


from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

template = """
You are an assistant tasked with answering questions based on Zendesk chat threads.Refer to the following chat threads to provide your response.Keep your answer concise, using a maximum of three sentences.
Question: {question}
Chat Threads: {chat_threads}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model="gpt-4-0125-preview")

Once the prompt is declared, implement the vector search to retrieve data relevant to the user input from the vector database. The retrieved data, along with the question, is then augmented with the previously declared prompt before being sent to the LLM for processing.


from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

def get_answer(question):    
	db = Chroma(persist_directory="./db", embedding_function=OpenAIEmbeddings())    
  inputs = {"chat_threads": db.as_retriever(), "question": RunnablePassthrough()}    
  rag_chain = (inputs | prompt | llm | StrOutputParser())    
  return rag_chain.invoke(question)

We will build the form for the front end, which captures user input and returns processed responses, using Streamlit. The following code can be added to the end of the bot.py file.


import streamlit as st

st.title("Zendesk - Q& A Bot")

with st.form("my_form"):    
	sample_question = "What is the biggest issue with datalakes?"    
  question = st.text_area("Enter text:", sample_question)    
  submitted = st.form_submit_button("Submit")    
  answer = get_answer(question)    
  st.info(answer)

4. Running and tuning the app

Run the Streamlit app from the command line with:


streamlit run bot.py 

As you run the Streamlit app and supply it with prompts, it may look like the following:

With Fivetran, your data is automatically refreshed in your data lake or warehouse, and it is a simple matter of calling load_vec_db.py to refresh your vector DB. As you further enrich the vector database with more data, your answers may become more comprehensive and accurate:

This is just the beginning

This retrieval model is only a relatively simple taste of what is possible using Fivetran to integrate text-rich data from SaaS applications and databases. Chatbots are only one possible application of this technology, and there are many ways to refine the example we have provided.

Furthermore, there is a lot of potential for aspects of this process to be further streamlined, modularized or otherwise improved. For instance, the SQL process for extracting and loading may be abstracted out and turned into a library, dbt model, stored procedure or even a software feature with a GUI. Future architectures may combine the capabilities of a data repository and vector database on the same platform, altogether obviating the need for a separate pipeline. There is very likely a technical solution to incrementally updating vector databases. Retrieval from vector databases can be filtered and optimized by topic or context. Models may benefit greatly from incorporating knowledge graphs that encode hard facts, not just statistical word associations.

Stay tuned for future developments – in this fast-growing field, more and more products and tools that further simplify architectures and workflows are coming down the pike, taking the grunt work out of generative AI and allowing easier bootstrapping of innovative AI products.

[CTA_MODULE]

Want to speak to our CTO's office about your GenAI use case?
Get in touch
Want to speak to our CTO's office about your GenAI use case?
Get in touch

Related blog posts

Assembling a RAG architecture using Fivetran
Product

Assembling a RAG architecture using Fivetran

Read post
How to build a data foundation for generative AI
Data insights

How to build a data foundation for generative AI

Read post
How generative AI is different from traditional AI
Data insights

How generative AI is different from traditional AI

Read post
How CIOs can drive AI success with a strong data foundation
Blog

How CIOs can drive AI success with a strong data foundation

Read post
AI readiness requires a unified data architecture
Blog

AI readiness requires a unified data architecture

Read post
The case for using structured and semi-structured data in generative AI
Blog

The case for using structured and semi-structured data in generative AI

Read post
Data ingestion costs: Where is your compute spend going?
Blog

Data ingestion costs: Where is your compute spend going?

Read post
How do people use Snowflake and Redshift?
Blog

How do people use Snowflake and Redshift?

Read post
How CIOs can drive AI success with a strong data foundation
Blog

How CIOs can drive AI success with a strong data foundation

Read post

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.