Demo
Problem Statement
Listing/reservation businesses like Yelp offer value to users by providing useful information to make them find out where to go next. Good search and recommendation systems go a long way, but they are still far from delivering the ultimate experience where users can interact naturally with the system for complex queries or have a conversation to drill down their needs.
Approach
Build a chatbot assistant to assist users in discovering places to go and booking.
Workflow:
- Download Yelp reviews data. Sample 5,240 reviews from 100 businesses.
- Set up development environment including experimentation tracking via MLflow, observability with Arize Phoenix
- Build MVP version using LlamaIndex, Qdrant
- Build synthetic evaluation datasets with 30 questions for retrieval and response. Manual response dataset are gradually built up and added based on error analysis
- Conduct error analysis and look at the model’s output to come up with new iteration ideas. Run a total of 10 experiments to improve the RAG, with notable attempts including: Replacing Llama-8.1B with GPT-4o-mini, Fine-tuning Embedding Model, Hybrid Retrievers, Semantic Chunking, BGE Reranker, Query Expansion
- Build RAG Agent based on OpenAI API with Query Engine tool and Reservation Service tool. Chatbot UI is built with Chainlit.
Evaluation
Evaluation results can be found here.
Two proposed key metrics are Retrieval Hit Rate and Response Correctness.
Retrieval is a critical component in any RAG system. Along with data prep, retrieval sits at the top of the pipeline so any improvements in these fronts is more likely to improve the overall system. Hit rate is chosen as a key metric because since we can employ rerank as a subsequent step, we have room to optimize for the ranking issues.
For response, Correctness measures both how relevant the answer is with respect to the query and how correct it is compared to the referenced answer. It’s therefore a better indicator than pure relevance, which is just based on the query and hence easier to get right.
For reference, Response Correctness on synthetic dataset has improved +166% from 1.75 / 5.00 from MVP version to 4.67 / 5.00 on the current version. The current Retrieval Hit Rate @ 50 reaches 73%, not directly comparable but at MVP version Retrieval Hit Rate @ 10 was 20%.
As next steps, while there is not much room to improve Response Correctness, we ought to increase Retrieval Hit Rate to 90% which should be doable since this dataset only contains a small amount of data.
Learnings/Remarks
- Using question-style query leads to 5-20% uplift in retrieval hit rate compared to using keyword search
- BM25 Retriever alone results in 200% increase in retrieval effectiveness including hit rate, average precision, MRR and NDCG
- Fine-tuning small embedding model like Snowflake/snowflake-arctic-embed-m-v1.5 yield +80% retrieval effectiveness, especially rankings of the retrieved nodes
- Using GPT-4o-mini as response synthesizer significantly improve the quality of response in all aspects (especially correctness from 2.6 to 3.8) compared to Llama 3.1-8B-Instruct
- Using TreeSummarize with custom prompt yields a +10% uplift on response correctness evaluation, from 3.97 to 4.37. Based on eyeballing we also see a way better response that is recommendation-like
Challenges
Challenge 1: Auto-retrieval not reliable
While theoretically both precision and recall should be greatly improved if we are able to apply the right filters for User questions instead of relying on embedding/keyword matching, my first attempt to apply auto-retrieval with ChromaDB did not yield promising results. There were at least two syntactic issues which broke the agentic workflow. Even after fixing those two the unreliable nature of this approach is still there and I also witnessed a -10% degradation in Retrieval Hit Rate.
In the end I forfeited the feature but nevertheless look forward to a way to re-applying this technique.
Challenges 2: Indexing pipeline takes too long
Indexing 70K nodes from 30K reviews for 400 businesses takes more than 6 hours!
Future Improvements
- Guardrail system inputs and outputs
- Experiment with Contextual Compression and Filters
- Fine tune LLM Re-ranker (FlagEmbedding BGE Reranker)
- Try ColBERT as a new retriever (may be add to the list of retrievers)
- Try different loss function in training embeddings
- Improve the diversity by implement custom re-ranker that weight downs the reviews from the already seen biz_id
If you find this article helpful, please cite this writeup as:
Quy, Dinh. (Sep 2024). Building a Conversational Assistant for Restaurant Discovery and Booking. dvquys.com. https://dvquys.com/projects/review-rec-bot/.