← Back to Posts

Arabic Legal Chatbot v3 (Case Study)

Overview

A production-grade Arabic-language legal Q&A chatbot built to make Libyan legal documents accessible through natural language. The system ingests structured legal corpora; retrieves relevant articles; and generates grounded answers with citations.

The problem

Libyan legal documents exist as dense Arabic text with no reliable way to query them. Arabic morphology makes naive keyword search unreliable; and hallucination is unacceptable in legal contexts.

Solution architecture

  • Query reformulation (LLM) to improve retrieval recall; and handle follow-ups
  • Hybrid retrieval: embeddings + Arabic-aware BM25 (Farasa-tokenized)
  • Grounded answer generation with explicit no-answer handling
  • Semantic caching (PostgreSQL) to reduce latency and cost
  • Multi-turn conversational mode with streaming output

Development highlights

  • Replaced keyword-only retrieval with Farasa-assisted BM25; plus embeddings
  • Added timing and payload logging early to debug production issues quickly
  • Expanded corpus (including civil and maritime law) with a repeatable ingestion pipeline
  • Implemented conversational mode; voice input; and response classification

Results

  • Faster repeat queries via caching
  • Higher retrieval accuracy via hybrid search + reformulation
  • Grounded answers only; with guardrails for out-of-scope questions

Key takeaways

  • Arabic NLP needs specialized preprocessing; generic tokenizers underperform.
  • Reformulate before retrieve; it is one of the highest-leverage RAG improvements.
  • Log everything from day one; it pays back in production.