Overview
A production-grade Arabic-language legal Q&A chatbot built to make Libyan legal documents accessible through natural language. The system ingests structured legal corpora; retrieves relevant articles; and generates grounded answers with citations.
The problem
Libyan legal documents exist as dense Arabic text with no reliable way to query them. Arabic morphology makes naive keyword search unreliable; and hallucination is unacceptable in legal contexts.
Solution architecture
- Query reformulation (LLM) to improve retrieval recall; and handle follow-ups
- Hybrid retrieval: embeddings + Arabic-aware BM25 (Farasa-tokenized)
- Grounded answer generation with explicit no-answer handling
- Semantic caching (PostgreSQL) to reduce latency and cost
- Multi-turn conversational mode with streaming output
Development highlights
- Replaced keyword-only retrieval with Farasa-assisted BM25; plus embeddings
- Added timing and payload logging early to debug production issues quickly
- Expanded corpus (including civil and maritime law) with a repeatable ingestion pipeline
- Implemented conversational mode; voice input; and response classification
Results
- Faster repeat queries via caching
- Higher retrieval accuracy via hybrid search + reformulation
- Grounded answers only; with guardrails for out-of-scope questions
Key takeaways
- Arabic NLP needs specialized preprocessing; generic tokenizers underperform.
- Reformulate before retrieve; it is one of the highest-leverage RAG improvements.
- Log everything from day one; it pays back in production.