Evaluation
The evaluation module provides tasks for measuring retrieval quality. Given a retriever and a test collection (topics + relevance assessments), it produces per-query and aggregate metric scores using the ir-measures library.
Evaluation tasks
These configurations define how a retrieval run is evaluated.
BaseEvaluation is the abstract base;
Evaluate is the standard concrete implementation.
The Evaluate task automatically handles multi-GPU
acceleration if the provided fabric_config specifies multiple devices.
It shards the retrieval/re-ranking task across GPUs and merges the final run
before computing metrics.
- XPM Taskxpmir.evaluation.BaseEvaluation(*, measures, with_run)[source]
Bases:
TaskSubmit type:
AnyBase class for evaluation tasks
- measures: List[xpmir.measures.Measure] = [Config[xpmir.measures.measure], Config[xpmir.measures.measure], Config[xpmir.measures.measure], Config[xpmir.measures.measure], Config[xpmir.measures.measure]]
List of metrics
- aggregated: pathgenerated
Path for aggregated results
- detailed: pathgenerated
Path for detailed results
- run_path: pathgenerated
Path to save the run (TREC format). Only used if with_run is True
- XPM Taskxpmir.evaluation.RunEvaluation(*, measures, with_run, run, assessments)[source]
Bases:
BaseEvaluation,TaskSubmit type:
AnyEvaluate a run
- measures: List[xpmir.measures.Measure] = [Config[xpmir.measures.measure], Config[xpmir.measures.measure], Config[xpmir.measures.measure], Config[xpmir.measures.measure], Config[xpmir.measures.measure]]
List of metrics
- aggregated: pathgenerated
Path for aggregated results
- detailed: pathgenerated
Path for detailed results
- run_path: pathgenerated
Path to save the run (TREC format). Only used if with_run is True
- assessments: datamaestro_ir.data.AdhocAssessments
- XPM Taskxpmir.evaluation.Evaluate(*, measures, with_run, dataset, retriever, topic_wrapper)[source]
Bases:
BaseEvaluation,TaskSubmit type:
AnyEvaluate a retriever directly (without generating the run explicitly)
- measures: List[xpmir.measures.Measure] = [Config[xpmir.measures.measure], Config[xpmir.measures.measure], Config[xpmir.measures.measure], Config[xpmir.measures.measure], Config[xpmir.measures.measure]]
List of metrics
- aggregated: pathgenerated
Path for aggregated results
- detailed: pathgenerated
Path for detailed results
- run_path: pathgenerated
Path to save the run (TREC format). Only used if with_run is True
- dataset: datamaestro_ir.data.Adhoc
The dataset for retrieval
- retriever: xpmir.rankers.retriever.Retriever
The retriever to evaluate
- topic_wrapper: datamaestro_ir.transforms.TopicWrapper
Topic extractor
- fabric_config: xpm_torch.configuration.FabricConfigurationgenerated
Runtime configuration, managed by Fabric
- class xpmir.evaluation.Evaluations(dataset: Adhoc, measures: List[Measure], *, topic_wrapper: TopicWrapper | None = None)[source]
Bases:
objectHolds experiment results for several models on one dataset
- class xpmir.evaluation.EvaluationsCollection(**collection: Evaluations)[source]
Bases:
objectA collection of evaluation
This is useful to group all the evaluations to be conducted, and then to call the
evaluate_retriever()- evaluate_retriever(retriever: Retriever | RetrieverFactory, launcher: Launcher = None, model_id: str | None = None, overwrite: bool = False, with_run: bool = False, init_tasks=[]) list[EvaluationResult][source]
Evaluate a retriever for all the evaluations in this collection (the tasks are submitted to the experimaestro scheduler)
- Parameters:
with_run – should the run be preserved (default False). Note that this changes the experiment ID.
Metrics
Metrics are backed by the ir-measures library.
Cut-off values can be specified with the @ operator.
- XPM Configxpmir.measures.Measure(*, identifier, rel, cutoff)[source]
Bases:
MeasureMirrors the ir_measures metric object
List of built-in measures:
- xpmir.measures.AP = Config[xpmir.measures.measure]
Average precision metric
- xpmir.measures.P = Config[xpmir.measures.measure]
Precision at rank
- xpmir.measures.R = Config[xpmir.measures.measure]
Recall at rank
- xpmir.measures.RR = Config[xpmir.measures.measure]
Reciprocical rank
- xpmir.measures.Success = Config[xpmir.measures.measure]
1 if a document with at least rel relevance is found in the first cutoff documents, else 0.
- xpmir.measures.nDCG = Config[xpmir.measures.measure]
Normalized Discounted Cumulated Gain
Example:
from xpmir.measures import AP, P, nDCG, RR
from xpmir.evaluation import Evaluate
measures = [AP, P@20, nDCG, nDCG@10, nDCG@20, RR, RR@10]
Evaluate(measures=measures, ...)