Arabic Legal Chatbot v3 (Case Study)

Overview

A production-grade Arabic-language legal Q&A chatbot built to make Libyan legal documents accessible through natural language. The system ingests structured legal corpora; retrieves relevant articles; and generates grounded answers with citations.

The problem

Libyan legal documents exist as dense Arabic text with no reliable way to query them. Arabic morphology makes naive keyword search unreliable; and hallucination is unacceptable in legal contexts.

Solution architecture

Query reformulation (LLM) to improve retrieval recall; and handle follow-ups
Hybrid retrieval: embeddings + Arabic-aware BM25 (Farasa-tokenized)
Grounded answer generation with explicit no-answer handling
Semantic caching (PostgreSQL) to reduce latency and cost
Multi-turn conversational mode with streaming output

Development highlights

Replaced keyword-only retrieval with Farasa-assisted BM25; plus embeddings
Added timing and payload logging early to debug production issues quickly
Expanded corpus (including civil and maritime law) with a repeatable ingestion pipeline
Implemented conversational mode; voice input; and response classification

Results

Faster repeat queries via caching
Higher retrieval accuracy via hybrid search + reformulation
Grounded answers only; with guardrails for out-of-scope questions

Key takeaways

Arabic NLP needs specialized preprocessing; generic tokenizers underperform.
Reformulate before retrieve; it is one of the highest-leverage RAG improvements.
Log everything from day one; it pays back in production.