πŸ“Š RAG Evaluation PipelineΒΆ

OverviewΒΆ

The RAG module includes an evaluator that can assess the full RAG pipeline, from context retrieval to the final LLM output.

The evaluation workflow consists of four main steps:

  1. prepare a benchmark evaluation dataset in the required format

  2. choose the metrics to evaluate

  3. configure the evaluator, indexer, and RAG pipeline

  4. run the evaluation for the selected retriever and LLM setup

MMORE relies on RAGAS for evaluation. RAGAS is a library designed for evaluating LLM applications.

πŸ’‘ TL;DRΒΆ

The evaluator lets you measure both retrieval quality and answer quality in a single workflow.

In practice, this means:

  • loading an evaluation dataset

  • configuring metrics and models

  • building the evaluation index

  • running the RAG pipeline against benchmark queries

  • computing evaluation scores with RAGAS

See the available RAGAS metrics.

πŸ’» Minimal ExampleΒΆ

Here’s a step-by-step guide to set up the evaluation pipeline:

1. Create the evaluator config fileΒΆ

This file defines the evaluation settings for your pipeline.

hf_dataset_name: "Mallard74/eval_medical_benchmark"  # Hugging Face Eval dataset name (Example dataset)
split: "train"  # Dataset split
hf_feature_map: {'user_input': 'user_input', 'reference': 'reference', 'corpus': 'corpus', 'query_id': 'query_ids'}  # Column mapping
metrics:  # List of metrics to evaluate
  - LLMContextRecall
  - Faithfulness
  - FactualCorrectness
  - SemanticSimilarity
embeddings_name: "all-MiniLM-L6-v2"  # Evaluator Embedding model name
llm:  # Evaluator LLM config
  llm_name: "gpt-4o"
  max_new_tokens: 150

2. Create the indexer config fileΒΆ

This file configures the indexer during evaluation.

dense_model_name: sentence-transformers/all-MiniLM-L6-v2
sparse_model_name: splade
db:
  uri: "./examples/rag/milvus_mock_eval_medical_benchmark.db"  # Dataset's Vectorstore URI
  name: "mock_eval_medical_benchmark"
chunker:
  chunking_strategy: sentence  # Your chunking strategy

3. Create the RAG pipeline config fileΒΆ

This file defines the RAG setup to evaluate.

llm:
  llm_name: "gpt-4o-mini"  # RAG LLM model to evaluate
  max_new_tokens: 150
retriever:
  db:
    uri: "./examples/rag/milvus_mock_eval_medical_benchmark.db"  # Dataset's Vectorstore URI
  hybrid_search_weight: 0.5
  k: 3

4. Run the evaluationΒΆ

Once the configuration files are in place, you can run the evaluation pipeline with the following Python script:

from mmore.rag.evaluator import RAGEvaluator

# Instantiate RAGEvaluator
evaluator = RAGEvaluator.from_config(args.eval_config)

# Run the evaluation
result = evaluator(
    indexer_config=args.indexer_config,
    rag_config=args.rag_config
)

Warning

Create a separate database file for each evaluation dataset.

The pipeline creates partitions per dense model for convenience.

πŸ“¦ OutputsΒΆ

The evaluation run returns a result object containing the selected metric scores for the evaluated setup.

The exact structure depends on the evaluator configuration and selected metrics.

See alsoΒΆ