Miscellaneous
Additional utility classes that support the main IR pipeline.
Data conversion
Converters transform between different data representations (e.g. converting retriever output formats).
ID lists
Configurations that represent ordered lists of document or topic IDs, used for filtering or subsetting collections.
Model export
Actions for exporting trained models (e.g. to HuggingFace Hub).
- XPM Configxpmir.models.XPMIRExportAction(*, loader, default_name, doc, bibtex)[source]
Bases:
ExportActionExport action that uses XPMIRHFHub for xpmir-specific README sections.
- loader: xpm_torch.module.ModuleLoader
The model loader to export
Validation
Listeners that monitor model performance during training and control early-stopping or best-model checkpointing.
- XPM Configxpmir.letor.validation.ValidationListener(*, id, metrics, dataset, retriever, warmup, validation_interval, early_stop, hooks)[source]
Bases:
LearnerListenerLearning validation early-stopping
Computes a validation metric and stores the best result. If early_stop is set (> 0), then it signals to the learner that the learning process can stop.
- metrics: Dict[str, bool] = {'map': True}
Dictionary whose keys are the metrics to record, and boolean values whether the best performance checkpoint should be kept for the associated metric ([parseable by ir-measures](https://ir-measur.es/))
- dataset: datamaestro_ir.data.Adhoc
The dataset to use
- retriever: xpmir.rankers.retriever.Retriever
The retriever for validation
- bestpath: pathgenerated
Path to the best checkpoints
- info: pathgenerated
Path to the JSON file that contains the metric values at each epoch
- early_stop: int = 0
Number of epochs without improvement after which we stop learning. Should be a multiple of validation_interval or 0 (no early stopping)
- hooks: List[xpm_torch.trainers.context.ValidationHook] = []
The list of the hooks during the validation
- XPM Configxpmir.letor.validation.AggregatorValidationListener(*, id, listeners, metrics, warmup, validation_interval, early_stop, hooks)[source]
Bases:
LearnerListenerAggregates multiple validation listeners
Stops when all the listeners agree to stop.
- listeners: List[xpmir.letor.validation.ValidationListener]
The list of validation listeners to aggregate
- metrics: Dict[str, bool] = {'map': True}
Dictionary whose keys are the metrics to record, and boolean values whether the best performance checkpoint should be kept for the associated metric ([parseable by ir-measures](https://ir-measur.es/))
- bestpath: pathgenerated
Path to the best checkpoints
- info: pathgenerated
Path to the JSON file that contains the metric values at each epoch
- early_stop: int = 0
Number of epochs without improvement after which we stop learning. Should be a multiple of validation_interval or 0 (no early stopping)
- hooks: List[xpm_torch.trainers.context.ValidationHook] = []
The list of the hooks during the validation
- XPM Configxpmir.letor.validation.ValidationSettings(*, listener, key)[source]
Bases:
ConfigSettings for a validation-specific ModuleLoader.
Attached as
settingson the loader to distinguish validation checkpoints from other loaders with the same model and path.- listener: xpm_torch.learner.LearnerListener
The listener (kept to change the loader identifier based on the learner listener configuration)
Processors
Pre- and post-processing transforms applied to documents, queries, or records before scoring.
- XPM Configxpmir.letor.processors.DocumentsProcessor[source]
Bases:
RecordsProcessor[DocIn,QueryIn,DocOut,QueryIn],Generic[DocIn,QueryIn,DocOut]Extracts documents from samples, processes them in batch, puts them back.
Queries are unchanged (QueryIn → QueryIn).
- XPM Configxpmir.letor.processors.QueriesProcessor[source]
Bases:
RecordsProcessor[DocIn,QueryIn,DocIn,QueryOut],Generic[DocIn,QueryIn,QueryOut]Extracts queries from samples, processes them in batch, puts them back.
Documents are unchanged (DocIn → DocIn).
Listwise distillation
Listwise distillation losses and trainers (see also Knowledge distillation for pairwise distillation).
- XPM Configxpmir.letor.distillation.listwise.DistillationListwiseLoss(*, weight)[source]
-
The abstract loss for listwise distillation
- XPM Configxpmir.letor.distillation.listwise.DistillationListwiseTrainer(*, hooks, model, sampler, batch_size, num_workers, lossfn)[source]
Bases:
LossTrainerListwise trainer for distillation
- hooks: List[xpm_torch.trainers.context.TrainingHook] = []
Hooks for this trainer: this includes the losses, but can be adapted for other uses The specific list of hooks depends on the specific trainer
- model: xpm_torch.module.Module
If the model to optimize is different from the model passsed to Learn, this parameter can be used – initialization is still expected to be done at the learner level
- batcher: xpm_torch.batchers.Batchergenerated
How to batch samples together
- sampler: xpm_torch.base.Sampler
The sampler to use
- lossfn: xpmir.letor.distillation.listwise.DistillationListwiseLoss
The distillation pairwise batch function
- XPM Configxpmir.letor.distillation.listwise.ListwiseSoftmaxCrossEntropy(*, weight)[source]
Bases:
DistillationListwiseLossReproduces the original SoftmaxCrossEntropy behavior used in batchwise losses, adapted to listwise distillation.
- The original formula is:
-logsumexp(normalize(scores) + (1 - 1.0 / relevances), dim=-1).mean()
where normalize depends on the model output type.
- XPM Configxpmir.letor.distillation.listwise.DistillRankNetLoss(*, weight)[source]
Bases:
DistillationListwiseLossAdaptation of the pairwise RankNET loss to lists of passages ranked by a LLM. Follows Rank-DistiLLM: Closing the Effectiveness Gap Between Cross-Encoders and LLMs for Passage Re-Ranking, 2025
- XPM Configxpmir.letor.distillation.listwise.ADR_MSE(*, weight)[source]
Bases:
DistillationListwiseLossNew loss to distill from lists of passages ranked by LLM, proposed by Rank-DistiLLM: Closing the Effectiveness Gap Between Cross-Encoders and LLMs for Passage Re-Ranking, 2025
- XPM Configxpmir.letor.distillation.samplers.DistillationListwiseSampler(*, samples)[source]
Bases:
SamplerJust loops over samples
- XPM Configxpmir.letor.distillation.samplers.DistillationNegativesSampler(*, samples, passages_per_query)[source]
Bases:
DistillationListwiseSamplerSamples only
passages_per_querydocuments per query.Skips queries that have no relevant document in the retrieved set.
Needs relevance judgements to ensure sampling one positive and (passages_per_query - 1) negatives per query.
Uses ScoredDocument to store relevance labels. Note: ignores any scores from the original dataset.
Index utilities
Bag-of-words retrieval and sparse-to-BMP format conversion.
- XPM Configxpmir.index.bow.BOWRetriever(*, store, index, model, topk, in_memory)[source]
Bases:
RetrieverBM25 retriever using the impact_index BOW index
This mirrors the AnseriniRetriever but uses the impact_index library for BM25 scoring instead of Lucene/pyserini.
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- index: xpmir.index.bow.BOWSparseRetrieverIndex
The BOW index
- model: xpmir.rankers.standard.Model
The scoring model (e.g. BM25)
- XPM Configxpmir.index.bow.BOWSparseRetrieverIndex(*, documents, index_path)[source]
Bases:
ConfigA bag-of-words index with BM25 scoring
Uses impact_index.BOWIndexBuilder for text-based tokenization and BM25 scoring at retrieval time.
- documents: datamaestro_ir.data.DocumentStore
The indexed document collection
- index_path: path
Path to the index directory
- XPM Taskxpmir.index.bow.BOWSparseRetrieverIndexBuilder(*, documents, stemmer, language, stop_words, batch_size, max_docs, in_memory_threshold, compress)[source]
Bases:
TaskSubmit type:
AnyBuilds a bag-of-words index from document text
Uses impact_index.BOWIndexBuilder to tokenize documents and store term frequencies + document lengths for BM25 scoring.
Defaults match Lucene/Pyserini’s EnglishAnalyzer pipeline: - Porter stemmer (original, not Snowball/Porter2) - English stop words (33-word Lucene default) - UAX#29 tokenization with English possessive filter - Block size 128 for effective block-max pruning
- documents: datamaestro_ir.data.DocumentStore
Set of documents to index
- index_path: pathgenerated
Path to store the index