Best practices for optimizing token consumption for AI chatbots using retrieval-augmented-generation (RAG)

The following image is from a single day's token consumption using OpenAI's GPT-4o on a chatbot Easie configured that uses retrieval-augmented-generation (RAG) to answer questions over 38 conversations and 459 messages.

Figure 1.1 - OpenAI token consumption on GPT-4o for an AI chatbot

RAG uses embeddings-based search to convert user inputs into dense vector representations then compares them using semantic search against a larger knowledge base that is stored in a vector database (e.g. Pinecone). The search results are then ranked based on relevance using cosine similarity (a number between 0 and 1).

The returned knowledge is inserted automatically into the input prompt along with the original user question. The following diagram depicts more on how RAG works:

 

Figure 1.2 - RAG process flow diagram (1)

 

Observations around token consumption & optimization

Here are some interesting observations around token consumption and optimizing for chatbots using large language models (LLMs) and large multimodal models (LMMs):

• The only way GPT-4o can "learn" things is through either model weights (i.e., fine-tuning the model on a training set) or via model inputs (i.e., insert the knowledge into an input message, what RAG is doing).

• This chatbot was configured to return up to ten results found in the vector database that had cosine similarity greater than 0.6 and insert the results into the user prompt.

• Without a strong system prompt in place about what to do if no knowledge is found, you have a higher chance of hallucinations.

• 97% of the token consumption in this case is context tokens, meaning the input that was used to prompt the model.

Despite GPT-4o input tokens being 1/3 the cost of output tokens and 50% cheaper than GPT-4 Turbo, token cost can add up significantly:

Figure 1.3 - Cost comparison between GPT-4o and GPT-4 Turbo (2)

Actionable advice to anyone using RAG for chatbots

Given all this, one might wonder what is the ideal configuration to optimize for accuracy while minimizing for token consumption. Here is some actionable advice to anyone using RAG for chatbots:

• Itemize your knowledge base as much as possible where each fact is as individualized and specific as possible. We have seen a higher rate of token consumption and more hallucinations on knowledge bases where many facts are chunked together into single sections.

• Understanding your use case will help you identify how to structure the corpus of data for the knowledge base (e.g. question-answer concatenation versus text search over a large set of documents).

• Decide how many retrieved responses to insert into your input from the embeddings-based search query. If you have itemized your knowledge into smaller amounts that are highly specific and have a higher cosine similarity threshold, you will reduce your input token consumption.

• Reducing the number of returned responses inserted into the input prompt gives you a higher chance of the chatbot failing to know the correct answer and potentially a higher chance of hallucinations. However, this can be a good practice for optimization as too many returned responses may be excessive.

• Determine what a good cosine similarity is for your specific use case and experiment with higher thresholds. You may have to experiment but less than 0.6 is not suggested.


The key takeaway for anyone building chatbots is that variations in how you configure an AI chatbot using RAG can have major effects on your token usage and output quality/accuracy.

Want Easie to help with your next AI project? Get in touch with our team to learn more about how we can help.


Previous
Previous

Easie featured by Bubble.io for diverse software engineering projects

Next
Next

Easie launches AI-enabled chatbot service for businesses, universities, local governments and non-profits