π RAG Evaluation PipelineΒΆ
OverviewΒΆ
The RAG module includes an evaluator that can assess the full RAG pipeline, from context retrieval to the final LLM output.
The evaluation workflow consists of four main steps:
prepare a benchmark evaluation dataset in the required format
choose the metrics to evaluate
configure the evaluator, indexer, and RAG pipeline
run the evaluation for the selected retriever and LLM setup
MMORE relies on RAGAS for evaluation. RAGAS is a library designed for evaluating LLM applications.
π‘ TL;DRΒΆ
The evaluator lets you measure both retrieval quality and answer quality in a single workflow.
In practice, this means:
loading an evaluation dataset
configuring metrics and models
building the evaluation index
running the RAG pipeline against benchmark queries
computing evaluation scores with RAGAS
See the available RAGAS metrics.
π» Minimal ExampleΒΆ
Hereβs a step-by-step guide to set up the evaluation pipeline:
1. Create the evaluator config fileΒΆ
This file defines the evaluation settings for your pipeline.
hf_dataset_name: "Mallard74/eval_medical_benchmark" # Hugging Face Eval dataset name (Example dataset)
split: "train" # Dataset split
hf_feature_map: {'user_input': 'user_input', 'reference': 'reference', 'corpus': 'corpus', 'query_id': 'query_ids'} # Column mapping
metrics: # List of metrics to evaluate
- LLMContextRecall
- Faithfulness
- FactualCorrectness
- SemanticSimilarity
embeddings_name: "all-MiniLM-L6-v2" # Evaluator Embedding model name
llm: # Evaluator LLM config
llm_name: "gpt-4o"
max_new_tokens: 150
2. Create the indexer config fileΒΆ
This file configures the indexer during evaluation.
dense_model_name: sentence-transformers/all-MiniLM-L6-v2
sparse_model_name: splade
db:
uri: "./examples/rag/milvus_mock_eval_medical_benchmark.db" # Dataset's Vectorstore URI
name: "mock_eval_medical_benchmark"
chunker:
chunking_strategy: sentence # Your chunking strategy
3. Create the RAG pipeline config fileΒΆ
This file defines the RAG setup to evaluate.
llm:
llm_name: "gpt-4o-mini" # RAG LLM model to evaluate
max_new_tokens: 150
retriever:
db:
uri: "./examples/rag/milvus_mock_eval_medical_benchmark.db" # Dataset's Vectorstore URI
hybrid_search_weight: 0.5
k: 3
4. Run the evaluationΒΆ
Once the configuration files are in place, you can run the evaluation pipeline with the following Python script:
from mmore.rag.evaluator import RAGEvaluator
# Instantiate RAGEvaluator
evaluator = RAGEvaluator.from_config(args.eval_config)
# Run the evaluation
result = evaluator(
indexer_config=args.indexer_config,
rag_config=args.rag_config
)
See
examples/rag/evaluationfor a simple example.
Warning
Create a separate database file for each evaluation dataset.
The pipeline creates partitions per dense model for convenience.
π¦ OutputsΒΆ
The evaluation run returns a result object containing the selected metric scores for the evaluated setup.
The exact structure depends on the evaluator configuration and selected metrics.