Miscellaneous

Additional utility classes that support the main IR pipeline.

Data conversion

Converters transform between different data representations (e.g. converting retriever output formats).

XPM Configxpmir.utils.convert.Converter[source]

Bases: Config, ABC, Generic[Input, Output]

ID lists

Configurations that represent ordered lists of document or topic IDs, used for filtering or subsetting collections.

XPM Configxpmir.misc.IDList[source]

Bases: Config, ABC

A configuration that returns a list of ids

XPM Configxpmir.misc.FileIDList(*, path)[source]

Bases: IDList

A file-based list of IDs

path: path

Model export

Actions for exporting trained models (e.g. to HuggingFace Hub).

XPM Configxpmir.models.XPMIRExportAction(*, loader, default_name, doc, bibtex)[source]

Bases: ExportAction

Export action that uses XPMIRHFHub for xpmir-specific README sections.

loader: xpm_torch.module.ModuleLoader

The model loader to export

default_name: str

Default HF Hub model name (for pre-fill)

doc: str

Paper description or title

bibtex: str

BibTeX citation

Validation

Listeners that monitor model performance during training and control early-stopping or best-model checkpointing.

XPM Configxpmir.letor.validation.ValidationListener(*, id, metrics, dataset, retriever, warmup, validation_interval, early_stop, hooks)[source]

Bases: LearnerListener

Learning validation early-stopping

Computes a validation metric and stores the best result. If early_stop is set (> 0), then it signals to the learner that the learning process can stop.

id: str

Unique ID to identify the listener (ignored for signature)

metrics: Dict[str, bool] = {'map': True}

Dictionary whose keys are the metrics to record, and boolean values whether the best performance checkpoint should be kept for the associated metric ([parseable by ir-measures](https://ir-measur.es/))

dataset: datamaestro_ir.data.Adhoc

The dataset to use

retriever: xpmir.rankers.retriever.Retriever

The retriever for validation

warmup: int = -1

How many epochs before actually computing the metric

bestpath: pathgenerated

Path to the best checkpoints

info: pathgenerated

Path to the JSON file that contains the metric values at each epoch

validation_interval: int = 1

Epochs between each validation

early_stop: int = 0

Number of epochs without improvement after which we stop learning. Should be a multiple of validation_interval or 0 (no early stopping)

hooks: List[xpm_torch.trainers.context.ValidationHook] = []

The list of the hooks during the validation

XPM Configxpmir.letor.validation.AggregatorValidationListener(*, id, listeners, metrics, warmup, validation_interval, early_stop, hooks)[source]

Bases: LearnerListener

Aggregates multiple validation listeners

Stops when all the listeners agree to stop.

id: str

Unique ID to identify the listener (ignored for signature)

listeners: List[xpmir.letor.validation.ValidationListener]

The list of validation listeners to aggregate

metrics: Dict[str, bool] = {'map': True}

Dictionary whose keys are the metrics to record, and boolean values whether the best performance checkpoint should be kept for the associated metric ([parseable by ir-measures](https://ir-measur.es/))

warmup: int = -1

How many epochs before actually computing the metric

bestpath: pathgenerated

Path to the best checkpoints

info: pathgenerated

Path to the JSON file that contains the metric values at each epoch

validation_interval: int = 1

Epochs between each validation

early_stop: int = 0

Number of epochs without improvement after which we stop learning. Should be a multiple of validation_interval or 0 (no early stopping)

hooks: List[xpm_torch.trainers.context.ValidationHook] = []

The list of the hooks during the validation

XPM Configxpmir.letor.validation.ValidationSettings(*, listener, key)[source]

Bases: Config

Settings for a validation-specific ModuleLoader.

Attached as settings on the loader to distinguish validation checkpoints from other loaders with the same model and path.

listener: xpm_torch.learner.LearnerListener

The listener (kept to change the loader identifier based on the learner listener configuration)

key: str

The metric key for this validation checkpoint

Processors

Pre- and post-processing transforms applied to documents, queries, or records before scoring.

XPM Configxpmir.letor.processors.DocumentsProcessor[source]

Bases: RecordsProcessor[DocIn, QueryIn, DocOut, QueryIn], Generic[DocIn, QueryIn, DocOut]

Extracts documents from samples, processes them in batch, puts them back.

Queries are unchanged (QueryIn → QueryIn).

XPM Configxpmir.letor.processors.QueriesProcessor[source]

Bases: RecordsProcessor[DocIn, QueryIn, DocIn, QueryOut], Generic[DocIn, QueryIn, QueryOut]

Extracts queries from samples, processes them in batch, puts them back.

Documents are unchanged (DocIn → DocIn).

XPM Configxpmir.letor.processors.RecordsProcessor[source]

Bases: Config, ABC, Generic[DocIn, QueryIn, DocOut, QueryOut]

Processes a batch of SampleItem[DocIn, QueryIn] into SampleItem[DocOut, QueryOut].

Listwise distillation

Listwise distillation losses and trainers (see also Knowledge distillation for pairwise distillation).

XPM Configxpmir.letor.distillation.listwise.DistillationListwiseLoss(*, weight)[source]

Bases: Config, Module

The abstract loss for listwise distillation

weight: float = 1.0
XPM Configxpmir.letor.distillation.listwise.DistillationListwiseTrainer(*, hooks, model, sampler, batch_size, num_workers, lossfn)[source]

Bases: LossTrainer

Listwise trainer for distillation

hooks: List[xpm_torch.trainers.context.TrainingHook] = []

Hooks for this trainer: this includes the losses, but can be adapted for other uses The specific list of hooks depends on the specific trainer

model: xpm_torch.module.Module

If the model to optimize is different from the model passsed to Learn, this parameter can be used – initialization is still expected to be done at the learner level

batcher: xpm_torch.batchers.Batchergenerated

How to batch samples together

sampler: xpm_torch.base.Sampler

The sampler to use

batch_size: int = 16

Number of samples per batch

num_workers: int = 2

Number of DataLoader workers

lossfn: xpmir.letor.distillation.listwise.DistillationListwiseLoss

The distillation pairwise batch function

XPM Configxpmir.letor.distillation.listwise.ListwiseSoftmaxCrossEntropy(*, weight)[source]

Bases: DistillationListwiseLoss

Reproduces the original SoftmaxCrossEntropy behavior used in batchwise losses, adapted to listwise distillation.

The original formula is:

-logsumexp(normalize(scores) + (1 - 1.0 / relevances), dim=-1).mean()

where normalize depends on the model output type.

weight: float = 1.0
XPM Configxpmir.letor.distillation.listwise.DistillRankNetLoss(*, weight)[source]

Bases: DistillationListwiseLoss

Adaptation of the pairwise RankNET loss to lists of passages ranked by a LLM. Follows Rank-DistiLLM: Closing the Effectiveness Gap Between Cross-Encoders and LLMs for Passage Re-Ranking, 2025

weight: float = 1.0
XPM Configxpmir.letor.distillation.listwise.ADR_MSE(*, weight)[source]

Bases: DistillationListwiseLoss

New loss to distill from lists of passages ranked by LLM, proposed by Rank-DistiLLM: Closing the Effectiveness Gap Between Cross-Encoders and LLMs for Passage Re-Ranking, 2025

weight: float = 1.0
XPM Configxpmir.letor.distillation.samplers.DistillationListwiseSampler(*, samples)[source]

Bases: Sampler

Just loops over samples

samples: datamaestro_ir.data.distillation.ListwiseDistillationSamples
XPM Configxpmir.letor.distillation.samplers.DistillationNegativesSampler(*, samples, passages_per_query)[source]

Bases: DistillationListwiseSampler

Samples only passages_per_query documents per query.

Skips queries that have no relevant document in the retrieved set.

  • Needs relevance judgements to ensure sampling one positive and (passages_per_query - 1) negatives per query.

  • Uses ScoredDocument to store relevance labels. Note: ignores any scores from the original dataset.

samples: datamaestro_ir.data.distillation.ListwiseDistillationSamplesTSVWithAnnotations
passages_per_query: int = 8

Index utilities

Bag-of-words retrieval and sparse-to-BMP format conversion.

XPM Configxpmir.index.bow.BOWRetriever(*, store, index, model, topk, in_memory)[source]

Bases: Retriever

BM25 retriever using the impact_index BOW index

This mirrors the AnseriniRetriever but uses the impact_index library for BM25 scoring instead of Lucene/pyserini.

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

index: xpmir.index.bow.BOWSparseRetrieverIndex

The BOW index

model: xpmir.rankers.standard.Model

The scoring model (e.g. BM25)

topk: int

Number of documents to return

in_memory: bool = False

Whether the index should be fully loaded in memory

XPM Configxpmir.index.bow.BOWSparseRetrieverIndex(*, documents, index_path)[source]

Bases: Config

A bag-of-words index with BM25 scoring

Uses impact_index.BOWIndexBuilder for text-based tokenization and BM25 scoring at retrieval time.

documents: datamaestro_ir.data.DocumentStore

The indexed document collection

index_path: path

Path to the index directory

XPM Taskxpmir.index.bow.BOWSparseRetrieverIndexBuilder(*, documents, stemmer, language, stop_words, batch_size, max_docs, in_memory_threshold, compress)[source]

Bases: Task

Submit type: Any

Builds a bag-of-words index from document text

Uses impact_index.BOWIndexBuilder to tokenize documents and store term frequencies + document lengths for BM25 scoring.

Defaults match Lucene/Pyserini’s EnglishAnalyzer pipeline: - Porter stemmer (original, not Snowball/Porter2) - English stop words (33-word Lucene default) - UAX#29 tokenization with English possessive filter - Block size 128 for effective block-max pruning

documents: datamaestro_ir.data.DocumentStore

Set of documents to index

stemmer: str = porter

Stemmer: ‘porter’ (Lucene-compatible), ‘snowball’ (Porter2), or ‘none’

language: str = english

Language for stemming and stop words

stop_words: bool = True

Whether to filter stop words (uses Lucene defaults for the language)

batch_size: int = 10000

Batch size for parallel text analysis

max_docs: int = 0

Maximum number of indexed documents (0 = all)

in_memory_threshold: int = 128

Block size for posting lists (128 = optimal for block-max pruning)

index_path: pathgenerated

Path to store the index

compress: bool = True

Whether to compress the index after building (default: True)

version: int = 3constant

Version 3: Porter stemmer, stop words, batch indexing, compression by default

XPM Taskxpmir.index.sparse.Sparse2BMPConverter(*, index, block_size, compress_range)

Bases: Task

Submit type: Any

index: xpmir.index.sparse.SparseRetrieverIndex

The sparse index

bmp_index_path: pathgenerated

The final index path

block_size: int

The block size

compress_range: bool

Flag for BMP index compression