Learning to Rank

This section covers the learning-to-rank (LTR) components: scorers that assign relevance scores to query-document pairs, samplers that produce training data, and trainers that optimise model parameters. XPMIR supports pointwise, pairwise, batchwise, and distillation training strategies.

Scorers

Scorers assign a relevance score to a (query, document) pair. AbstractModuleScorer is the base class for scorers with learnable parameters (neural models).

XPM Configxpmir.rankers.scorer.Scorer(*, doc, bibtex)[source]

Bases: Config, Initializable, EasyLogger, ABC

Query-document scorer

A model able to give a score to a list of documents given a query

doc: str: Paper description or title (used in HF Hub README)

bibtex: str: BibTeX citation (used in HF Hub README)

abstractmethod compute(topic: IDTextRecord, documents: Iterable[ScoredDocument]) → List[ScoredDocument][source]

Score all documents with respect to a single topic.

This method should be implemented by subclasses to provide the actual scoring logic. It is query-atomic (processes one query at a time).

getRetriever(retriever: Retriever, batch_size: int, top_k=None, device=None)[source]

Returns a two stage re-ranker from this retriever and a scorer

Parameters:

device – Device for the ranker or None if no change should be made
batch_size – The number of documents in each batch
top_k – Number of documents to re-rank (or None for all)

initialize(*args, **kwargs)

Main initialization

Calls __initialize__() once (using __initialize__())

rsv(topic: str | IDTextRecord, documents: List[ScoredDocument] | ScoredDocument | str | List[str]) → List[ScoredDocument][source]

Compute the Retrieval Status Value (RSV) for a query and a set of documents.

This method is the primary entry point for scoring a set of documents against a single query. It handles input normalization and delegates to the compute() method.

Note

For large-scale evaluation involving multiple queries, using Retriever.retrieve_all() via a TwoStageRetriever is preferred as it allows for cross-query batching on GPUs.

XPM Configxpmir.rankers.scorer.RandomScorer(*, doc, bibtex, random)[source]

Bases: Scorer

A random scorer

doc: str: Paper description or title (used in HF Hub README)

bibtex: str: BibTeX citation (used in HF Hub README)

random: xpm_torch.base.Random: The random number generator

XPM Configxpmir.rankers.scorer.AbstractModuleScorer(*, doc, bibtex)[source]

Bases: Scorer, Module

Base class for all torch-based Modules implementing the xpmir.rankers.Scorer.

While compute() (inherited from Scorer) processes documents for a single query, AbstractModuleScorer also supports cross-query batching when called directly through its forward method (aliased as __call__).

When used in a TwoStageRetriever with a batchsize > 0, the retriever will use the PointwiseItems batching to maximize GPU utilization across multiple queries.

doc: str: Paper description or title (used in HF Hub README)

bibtex: str: BibTeX citation (used in HF Hub README)

xpmir.rankers.scorer_retriever(documents: Documents, *, retrievers: RetrieverFactory, scorer: Scorer, key: str = None, **kwargs)[source]

Helper function that returns a two stage retriever. This is useful when used with partial (when the scorer is not known).

Parameters:

documents – The document collection
retrievers – A retriever factory
scorer – The scorer

Returns:

A retriever, calling the :meth:scorer.getRetriever

Retrievers from scorers

Scorers can be wrapped as retrievers through a TwoStageRetriever (see Retrieval).

Naming conventions

The project uses consistent naming for data objects at different layers:

Records – Low-level data structures (e.g. IDTextRecord, ScoreRecord). Implemented as TypedDict for raw data or identifiers.
Samples – Data-layer objects (e.g. PairwiseSample). Found in datamaestro; represent raw containers, possibly non-hydrated.
Items – Model-ready objects (e.g. PointwiseItem, PairwiseItem). Hydrated objects used in the training loop, ready to be converted into tensors.

Samplers

Samplers generate model-ready items from a dataset and a scorer (used for hard-negative mining or scoring).

XPM Configxpmir.letor.samplers.ModelBasedSampler(*, dataset, retriever)[source]

Bases: Sampler

Base class for retriever-based sampler

dataset: datamaestro_ir.data.Adhoc: The IR adhoc dataset

retriever: xpmir.rankers.retriever.Retriever: A retriever to sample negative documents

Training items

Data classes representing training instances at different granularities.

class xpmir.letor.records.BatchwiseItems(iterable=(), /)[source]

Bases: BaseItems

Several documents (with associated [pseudo]relevance) per query

Assumes that the number of documents per query is always the same (even though documents themselves can be different)

class xpmir.letor.records.ListwiseItem(query: QueryT, documents: List[DocT])[source]

Bases: SampleItem[DocT, QueryT]

A listwise Item is a generic data class composed of a query and a list of documents

class xpmir.letor.records.PairwiseItem(query: QueryT, positive: DocT, negative: DocT)[source]

Bases: SampleItem[DocT, QueryT]

A pairwise record is composed of a query, a positive and a negative document

class xpmir.letor.records.PointwiseItem(topic: QueryT, document: DocT, relevance: float | None = None)[source]

Bases: SampleItem[DocT, QueryT]

An Item from a pointwise sampler

Document samplers

Samplers that produce documents (without queries). Useful for pre-training objectives or for learning index parameters (e.g. FAISS quantisers).

XPM Configxpmir.documents.samplers.DocumentSampler(*, documents)[source]

Bases: Config, ABC

How to sample from a document store

documents: datamaestro_ir.data.DocumentStore

XPM Configxpmir.documents.samplers.HeadDocumentSampler(*, documents, max_count, max_ratio)[source]

Bases: DocumentSampler

A basic sampler that iterates over the first documents

if max_count is 0, it iterates over all documents

documents: datamaestro_ir.data.DocumentStore

max_count: int = 0: Maximum number of documents (if 0, no limit)

max_ratio: float = 0: Maximum ratio of documents (if 0, no limit)

XPM Configxpmir.documents.samplers.RandomDocumentSampler(*, documents, max_count, max_ratio, random)[source]

Bases: DocumentSampler

A basic sampler that iterates over the first documents

Either max_count or max_ratio should be non null

documents: datamaestro_ir.data.DocumentStore

max_count: int = 0: Maximum number of documents (if 0, no limit)

max_ratio: float = 0: Maximum ratio of documents (if 0, no limit)

random: xpm_torch.base.Random: Random sampler

Sample adapters

Transforms applied to samples before they reach the model (e.g. hydrating document text from a store, adding query prefixes).

XPM Configxpmir.letor.samplers.hydrators.SampleTransform[source]: Bases: Config, ABC

XPM Configxpmir.letor.samplers.hydrators.SampleHydrator(*, documentstore, querystore)[source]

Bases: SampleTransform

Base class for document/topic hydrators (deprecated: use StoreHydrator + SamplerAdapter)

documentstore: datamaestro_ir.data.DocumentStore: The store for document texts if needed

querystore: xpmir.datasets.adapters.TextStore: The store for query texts if needed

XPM Configxpmir.letor.samplers.hydrators.SamplePrefixAdding(*, query_prefix, document_prefix)[source]

Bases: SampleTransform

Transform the query and documents by adding the prefix

query_prefix: str: The prefix for the query

document_prefix: str: The prefix for the document

XPM Configxpmir.letor.samplers.hydrators.SampleTransformList(*, adapters)[source]

Bases: SampleTransform

A class which group a list of sample transforms

adapters: List[xpmir.letor.samplers.hydrators.SampleTransform]: The list of sample transform to be applied