Best practices for optimizing token consumption for AI chatbots using retrieval-augmented-generation (RAG)
The following image is from a single day's token consumption using OpenAI's GPT-4o on a chatbot Easie configured that uses retrieval-augmented-generation (RAG) to answer questions over 38 conversations and 459 messages.
RAG uses embeddings-based search to convert user inputs into dense vector representations then compares them using semantic search against a larger knowledge base that is stored in a vector database (e.g. Pinecone). The search results are then ranked based on relevance using cosine similarity (a number between 0 and 1).
The returned knowledge is inserted automatically into the input prompt along with the original user question. The following diagram depicts more on how RAG works:
Observations around token consumption & optimization
Here are some interesting observations around token consumption and optimizing for chatbots using large language models (LLMs) and large multimodal models (LMMs):
• The only way GPT-4o can "learn" things is through either model weights (i.e., fine-tuning the model on a training set) or via model inputs (i.e., insert the knowledge into an input message, what RAG is doing).
• This chatbot was configured to return up to ten results found in the vector database that had cosine similarity greater than 0.6 and insert the results into the user prompt.
• Without a strong system prompt in place about what to do if no knowledge is found, you have a higher chance of hallucinations.
• 97% of the token consumption in this case is context tokens, meaning the input that was used to prompt the model.
Despite GPT-4o input tokens being 1/3 the cost of output tokens and 50% cheaper than GPT-4 Turbo, token cost can add up significantly:
Actionable advice to anyone using RAG for chatbots
Given all this, one might wonder what is the ideal configuration to optimize for accuracy while minimizing for token consumption. Here is some actionable advice to anyone using RAG for chatbots:
• Itemize your knowledge base as much as possible where each fact is as individualized and specific as possible. We have seen a higher rate of token consumption and more hallucinations on knowledge bases where many facts are chunked together into single sections.
• Understanding your use case will help you identify how to structure the corpus of data for the knowledge base (e.g. question-answer concatenation versus text search over a large set of documents).
• Decide how many retrieved responses to insert into your input from the embeddings-based search query. If you have itemized your knowledge into smaller amounts that are highly specific and have a higher cosine similarity threshold, you will reduce your input token consumption.
• Reducing the number of returned responses inserted into the input prompt gives you a higher chance of the chatbot failing to know the correct answer and potentially a higher chance of hallucinations. However, this can be a good practice for optimization as too many returned responses may be excessive.
• Determine what a good cosine similarity is for your specific use case and experiment with higher thresholds. You may have to experiment but less than 0.6 is not suggested.
The key takeaway for anyone building chatbots is that variations in how you configure an AI chatbot using RAG can have major effects on your token usage and output quality/accuracy.
Want Easie to help with your next AI project? Get in touch with our team to learn more about how we can help.