Retrieval

This page describes the components used for retrieval, i.e. finding the top \(k\) documents from a collection given a query. XPMIR supports classical models (BM25, query-likelihood), dense retrieval (FAISS), sparse learned retrieval, and late-interaction models (ColBERT/PLAID), as well as multi-stage pipelines that combine them.

Base classes

Core data structures and the abstract retriever interface that all implementations extend.

class xpmir.rankers.ScoredDocument(document: dict, score: float)[source]

Bases: object

A data structure that associates a score with a document, allowing to sort documents by score (e.g., for nDCG)

XPM Configxpmir.rankers.retriever.Retriever(*, store)[source]

Bases: Config, ModuleContainer, ABC

A retriever is a model to return top-scored documents given a query

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

collection()[source]

Returns the document collection object

abstractmethod retrieve(record: IDTextRecord) List[ScoredDocument][source]

Retrieves documents, returning a list sorted by decreasing score

if content is true, includes the document full text

retrieve_all(queries: Dict[str, IDTextRecord]) Dict[str, List[ScoredDocument]][source]

Retrieves for a set of documents

By default, iterate using self.retrieve, but this leaves some room open for optimization

Parameters:

queries – A dictionary where the key is the ID of the query, and the value is the text

Standard IR models

Definitions for classical probabilistic retrieval models. These are backend-agnostic specifications that can be instantiated with a concrete engine such as AnseriniRetriever.

XPM Configxpmir.rankers.standard.Model[source]

Bases: Config

Base class for standard IR models

XPM Configxpmir.rankers.standard.BM25(*, k1, b)[source]

Bases: Model

BM-25 model definition

k1: float = 0.9
b: float = 0.4
XPM Configxpmir.rankers.standard.QLDirichlet(*, mu)[source]

Bases: Model

Query likelihood (Dirichlet smoothing) model definition

mu: float = 1000

Multi-stage retrievers

In a re-ranking setting, a two-stage retriever first retrieves candidates with a fast first-stage model, then re-scores them with a more expensive scorer.

The re-ranking process is memory-efficient: it uses lazy evaluation of first-stage results and maximises GPU throughput by batching query-document pairs across multiple queries.

XPM Configxpmir.rankers.scorer.AbstractTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize)[source]

Bases: Retriever

Abstract class for all two stage retrievers (i.e. scorers and duo-scorers)

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

retriever: xpmir.rankers.retriever.Retriever

The base retriever

scorer: xpmir.rankers.scorer.Scorer

The scorer used to re-rank the documents

top_k: int

The number of returned documents (if None, returns all the documents)

batchsize: int = 0

The batch size for the re-ranker

XPM Configxpmir.rankers.scorer.TwoStageRetriever(*, store, retriever, scorer, top_k, batchsize)[source]

Bases: AbstractTwoStageRetriever

Use on retriever to select the top-K documents which are the re-ranked given a scorer.

Multi-GPU support:

When set up with a lightning.Fabric instance, retrieve_all() shards the re-ranking task across GPUs and gathers the results. It uses efficient cross-query batching to maximize GPU throughput.

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

retriever: xpmir.rankers.retriever.Retriever

The base retriever

scorer: xpmir.rankers.scorer.Scorer

The scorer used to re-rank the documents

top_k: int

The number of returned documents (if None, returns all the documents)

batchsize: int = 0

The batch size for the re-ranker

Duo-retrievers

Duo-retrievers predict which of two candidate documents is more relevant to the query (pairwise preference), rather than assigning an absolute score.

XPM Configxpmir.rankers.scorer.DuoTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize)[source]

Bases: AbstractTwoStageRetriever

The two stage retriever for pairwise scorers.

For pairwise scorer, we need to aggregate the pairwise scores in some way.

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

retriever: xpmir.rankers.retriever.Retriever

The base retriever

scorer: xpmir.rankers.scorer.Scorer

The scorer used to re-rank the documents

top_k: int

The number of returned documents (if None, returns all the documents)

batchsize: int = 0

The batch size for the re-ranker

XPM Configxpmir.rankers.scorer.DuoLearnableScorer(*, doc, bibtex)[source]

Bases: AbstractModuleScorer

Base class for models that can score a triplet (query, document 1, document 2)

doc: str

Paper description or title (used in HF Hub README)

bibtex: str

BibTeX citation (used in HF Hub README)

Miscellaneous retrievers

Utility retrievers for loading pre-computed runs, hydrating results with document text, or exhaustive scoring.

XPM Configxpmir.rankers.full.FullRetriever(*, store, documents)[source]

Bases: Retriever

Retrieves all the documents of the collection

This can be used to build a small validation set on a subset of the collection - in that case, the scorer can be used through a TwoStageRetriever, with this retriever as the base retriever.

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

documents: datamaestro_ir.data.Documents
XPM Configxpmir.rankers.full.FullRetrieverRescorer(*, store, documents, scorer, batchsize, batcher)[source]

Bases: Retriever

Scores all the documents from a collection

Encodes all queries at once, then processes documents in batches, scoring the full query×document matrix each batch. This is more efficient than the TwoStageRetriever approach for small collections.

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

documents: datamaestro_ir.data.Documents

The set of documents to consider

scorer: xpmir.neural.DualRepresentationScorer

The scorer (a dual representation scorer)

batchsize: int = 0
batcher: xpm_torch.batchers.Batcher = xpm_torch.batchers.Batcher()
XPM Configxpmir.rankers.retriever.RetrieverHydrator(*, store, retriever)[source]

Bases: Retriever

Hydrate retrieved results with document text

store: datamaestro_ir.data.DocumentStore

The store for document texts

retriever: xpmir.rankers.retriever.Retriever

The retriever to hydrate

XPM Configxpmir.rankers.retriever.RunRetriever(*, store, run, documents)[source]

Bases: Retriever

A retriever that returns documents from a pre-computed run Can be useful to build a two-stage retriever with precomputed first stage (e.g for validation when training a scorer model)

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

run: datamaestro_ir.data.AdhocRun

The pre-computed run

documents: datamaestro_ir.data.Documents

Associated documents

Distributed Retrieval

XPMIR supports distributed retrieval and re-ranking across multiple GPUs using Lightning Fabric. Currently, this optimized distributed logic is implemented for :class:`~xpmir.index.sparse.SparseRetriever` and :class:`~xpmir.rankers.scorer.TwoStageRetriever`.

This is particularly useful for large-scale evaluation on thousands of queries.

How it works

When a retriever is configured with a Fabric instance, the retrieve_all() method leverages all available devices:

  1. Query Sharding: The set of queries is automatically partitioned across the available GPUs.

  2. Parallel Processing: Each device processes its assigned shard. - For Sparse Retrieval (SparseRetriever),

    queries are encoded in batches and searched via asynchronous workers.

    • For Two-Stage Retrieval (TwoStageRetriever), document re-ranking is batched across queries for maximum throughput.

  3. Result Gathering: Once processing is complete, results are collected from all ranks and merged on the global zero rank.

Usage

Distributed retrieval is automatically enabled when using the Evaluate task with a multi-GPU FabricConfiguration.

To use it manually in a script:

from lightning import Fabric
fabric = Fabric(devices=2, strategy="ddp")
fabric.launch()

retriever.initialize()
retriever.setup_with_fabric(fabric)

# Distributed retrieval
results = retriever.retrieve_all(queries)

# Results are gathered on rank 0
if fabric.is_global_zero:
    print(f"Total queries retrieved: {len(results)}")

Index backends

The sections below describe the available index backends and their associated retrievers.

Anserini

Anserini provides classical inverted-index retrieval (BM25, query-likelihood, etc.) via Lucene.

XPM Configxpmir.index.anserini.Index(*, id, count, file_access, path, storePositions, storeDocvectors, storeRaw, storeContents, stemmer)[source]

Bases: AdhocIndex

Anserini-backed index

id: str

The unique (sub-)dataset ID

count: int

Number of documents

file_access: FileAccess = FileAccess.MMAP

How to access the file collection (might not have any impact, depends on the docstore)

path: path

Path to the index

storePositions: bool = False

Store term positions

storeDocvectors: bool = False

Store document term vectors

storeRaw: bool = False

Store raw document

storeContents: bool = False

Store processed documents (e.g. without HTML tags)

stemmer: str = porter

The stemmer to use

XPM Configxpmir.interfaces.anserini.AnseriniRetriever(*, store, index, model, k)[source]

Bases: Retriever

An Anserini-based retriever

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

index: xpmir.index.anserini.Index

The Anserini index

model: xpmir.rankers.standard.Model

the model used to search. Only supports BM25 so far.

k: int = 1500

Number of results to retrieve

XPM Taskxpmir.interfaces.anserini.IndexCollection(*, id, count, file_access, storePositions, storeDocvectors, storeRaw, storeContents, stemmer, documents, threads)[source]

Bases: Index, Task

An [Anserini](https://github.com/castorini/anserini) index

id: str

Use an empty ID since identifier is determined by documents

count: int

Number of documents

file_access: FileAccess = FileAccess.MMAP

How to access the file collection (might not have any impact, depends on the docstore)

path: pathgenerated
storePositions: bool = False

Store term positions

storeDocvectors: bool = False

Store document term vectors

storeRaw: bool = False

Store raw document

storeContents: bool = False

Store processed documents (e.g. without HTML tags)

stemmer: str = porter

The stemmer to use

documents: datamaestro_ir.data.Documents

The documents to index

threads: int = 8

Number of threads when indexing

XPM Taskxpmir.interfaces.anserini.SearchCollection(*, index, topics, model)[source]

Bases: Task

index: xpmir.index.anserini.Index
topics: datamaestro_ir.data.Topics
model: xpmir.rankers.standard.Model
path: pathgenerated

FAISS

FAISS provides approximate nearest-neighbour search for dense vector retrieval.

XPM Configxpmir.index.faiss.FaissIndex(*, normalize, documents)[source]

Bases: Config

FAISS Index

normalize: bool

Whether vectors should be normalized (L2)

faiss_index: pathgenerated

Path to the file containing the index

documents: datamaestro_ir.data.DocumentStore

The set of documents

XPM Taskxpmir.index.faiss.IndexBackedFaiss(*, normalize, documents, encoder, batchsize, batcher, hooks, indexspec, sampler)[source]

Bases: FaissIndex, Task

Constructs a FAISS index backed up by an index

During executions, InitializationHooks are used (pre/post)

normalize: bool

Whether vectors should be normalized (L2)

faiss_index: pathgenerated

Path to the file containing the index

documents: datamaestro_ir.data.DocumentStore

The set of documents

encoder: xpmir.text.encoders.TextEncoder

Encoder for document texts

batchsize: int = 1

The batch size used when computing representations of documents

batcher: xpm_torch.batchers.Batcher = xpm_torch.batchers.Batcher()

The way to prepare batches of documents

hooks: List[xpmir.context.Hook] = []

An optional list of hooks

indexspec: str

The index type as a factory string See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes for the full list of indices and https://github.com/facebookresearch/faiss/wiki/The-index-factory for the combination of the index factory

sampler: xpmir.documents.samplers.DocumentSampler

Optional document sampler when training the index – by default, all the documents from the collection are used

XPM Configxpmir.index.faiss.FaissRetriever(*, store, encoder, index, topk)[source]

Bases: Retriever

Retriever based on Faiss

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

encoder: xpmir.text.encoders.TextEncoder

The query encoder

index: xpmir.index.faiss.FaissIndex

The faiss index

topk: int

the number of documents to be retrieved

fast-plaid (ColBERT / PLAID)

Interface to fast-plaid, a Rust-based implementation of PLAID / ColBERT late-interaction retrieval. Per-document token vectors can be reconstructed from the compressed index via get_document_tokens().

XPM Configxpmir.index.plaid.PlaidIndex(*, documents, compress_only, index_path)[source]

Bases: Config

A ColBERT / PLAID index backed by fast-plaid.

The index stores per-token document embeddings in fast-plaid’s compressed centroid + residual format. Per-document token vectors can be reconstructed (approximately) via get_document_tokens(), which delegates to fast-plaid’s get_embeddings method. The reconstruction quality is controlled by n_bits.

When compress_only is True the index only contains the compressed vectors (centroids + quantised residuals) without the IVF search structure. This is cheaper to build and sufficient when only get_document_tokens() is needed. Attempting to search a compress-only index via PlaidRetriever will raise an error.

documents: datamaestro_ir.data.DocumentStore

Set of documents to index.

compress_only: bool = False
index_path: path

Directory containing the fast-plaid index and side-car files.

get_document_tokens(docid: int | str, device: str = '') Tensor[source]

Return the (approximate) per-token embeddings for a document.

The vectors are reconstructed from fast-plaid’s compressed centroid + residual storage using FastPlaid.get_embeddings. The reconstruction quality depends on n_bits.

Parameters:
  • docid – The document identifier. Integers are interpreted as internal positions in the index (0..num_docs-1); strings are looked up in the external-to-internal map written at indexing time.

  • device – Device for the fast-plaid instance used to decompress ("" = auto).

Returns:

A (num_tokens, dim) float tensor containing the reconstructed token embeddings.

Raises:

KeyError – if the external identifier is unknown.

XPM Taskxpmir.index.plaid.PlaidIndexBuilder(*, documents, encoder, batch_size, n_bits, kmeans_niters, n_samples_kmeans, compress_only, low_memory, device)[source]

Bases: Task

Submit type: xpmir.index.plaid.PlaidIndex

Builds a fast-plaid index from a document collection.

The builder encodes every document using the given ColBERTEncoder, collects the valid (i.e. non-padding) token vectors, and feeds them to fast-plaid.

The fast-plaid index stores the embeddings in a compressed centroid + residual format, so no separate raw-token file is needed. Per-document token vectors can be reconstructed later via PlaidIndex.get_document_tokens().

documents: datamaestro_ir.data.DocumentStore

Set of documents to index.

encoder: xpmir.text.encoders.TextEncoderBase

The ColBERT-style encoder used to produce per-token embeddings.

batch_size: int = 32

Encoder batch size.

n_bits: int = 2

Number of bits used by fast-plaid for residual quantisation.

kmeans_niters: int = 4

Number of K-means iterations performed by fast-plaid when clustering the centroids.

n_samples_kmeans: int = 0

Number of token samples used to train the centroids (0 = fast-plaid default).

compress_only: bool = False

When True, skip IVF construction. The resulting index supports PlaidIndex.get_document_tokens() but not search via PlaidRetriever. Requires fast-plaid support for compress_only (see lightonai/fast-plaid#41). Falls back to building the full index with a warning if unsupported.

low_memory: bool = True

https://github.com/lightonai/fast-plaid#-search-speed-tip-low_memoryfalse If index fits on VRAM, set to False for faster search. Otherwise, keep True to avoid OOM errors.

device: str

Device for fast-plaid ("" = auto: cuda if available, cpu otherwise).

fabric_config: xpm_torch.configuration.FabricConfigurationgenerated

Control the device for the model encoding (separate from device which controls the fast-plaid side).

index_path: pathgenerated

Output directory for the index and its side-car files.

XPM Configxpmir.index.plaid.PlaidRetriever(*, store, encoder, index, topk, n_ivf_probe, n_full_scores, device)[source]

Bases: Retriever

Retriever using a fast-plaid PLAID index.

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

encoder: xpmir.rankers.scorer.AbstractModuleScorer

The query encoder. Typically the same encoder that was used to build index.

index: xpmir.index.plaid.PlaidIndex

The fast-plaid index to search.

topk: int

Number of documents to return per query.

n_ivf_probe: int = 8

Number of inverted-list clusters explored by fast-plaid at search time.

n_full_scores: int = 0

Number of candidates for which fast-plaid computes full scores (0 = fast-plaid default).

device: str

Device for fast-plaid ("" = auto).

Sparse retrieval

Learned sparse retrieval indexes (e.g. for SPLADE), backed by the impact-index Rust library.

XPM Configxpmir.index.sparse.AbstractSparseRetrieverIndex(*, documents)

Bases: Config, ABC

documents: datamaestro_ir.data.DocumentStore

The indexed document collection

XPM Taskxpmir.index.sparse.AbstractSparseRetrieverIndexBuilder(*, documents, encoder, batch_size, ordered_index, max_docs)

Bases: Task, ABC, Generic[InputType]

Builds an index from a sparse representation

Assumes that document and queries have the same dimension, and that the score is computed through an inner product

documents: datamaestro_ir.data.DocumentStore

Set of documents to index

encoder: xpmir.text.encoders.TextEncoderBase

The encoder

batcher: xpm_torch.batchers.Batchergenerated

Batcher used when computing representations

batch_size: int

Size of batches

ordered_index: bool

Ordered index: if not ordered, use DAAT strategy (WAND), otherwise, use fast top-k strategies

version: int = 3constant

Version 3 of the index

max_docs: int = 0

Maximum number of indexed documents

XPM Configxpmir.index.sparse.SparseRetriever(*, store, index, encoder, topk, batchsize, in_memory)

Bases: Retriever, Generic[InputType]

Retriever for learned sparse models (e.g. SPLADE).

This retriever uses a TextEncoderBase to encode queries into sparse vectors, which are then used to search an AbstractSparseRetrieverIndex.

Multi-GPU support:

When set up with a lightning.Fabric instance, retrieve_all() automatically shards the queries across GPUs and merges the results. It also adjusts the number of asynchronous search workers to prevent CPU oversubscription.

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

index: xpmir.index.sparse.AbstractSparseRetrieverIndex

The sparse retriever index

encoder: xpmir.text.encoders.TextEncoderBase

Encodes InputType records to text representation output

topk: int

Number of documents to return

batcher: xpm_torch.batchers.Batchergenerated

The way to prepare batches of queries (when using retrieve_all)

batchsize: int

Size of batches (when using retrieve_all)

in_memory: bool = False

Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

Impact library (Rust)

XPM Configxpmir.index.sparse.SparseRetrieverIndex(*, documents, index_path)

Bases: AbstractSparseRetrieverIndex

documents: datamaestro_ir.data.DocumentStore

The indexed document collection

index_path: path
XPM Taskxpmir.index.sparse.SparseRetrieverIndexBuilder(*, documents, encoder, batch_size, ordered_index, max_docs, in_memory, checkpoint_frequency, max_postings)

Bases: AbstractSparseRetrieverIndexBuilder[InputType]

Submit type: Any

documents: datamaestro_ir.data.DocumentStore

Set of documents to index

encoder: xpmir.text.encoders.TextEncoderBase

The encoder

batcher: xpm_torch.batchers.Batchergenerated

Batcher used when computing representations

batch_size: int

Size of batches

ordered_index: bool

Ordered index: if not ordered, use DAAT strategy (WAND), otherwise, use fast top-k strategies

version: int = 3constant

Version 3 of the index

max_docs: int = 0

Maximum number of indexed documents

in_memory: bool = False

Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

index_path: pathgenerated
checkpoint_frequency: int = 0

Checkpoint frequency (allows recovery at the cost of writing some information to disk)

max_postings: int

Number of postings before dumping a term postings to disk

fabric_config: xpm_torch.configuration.FabricConfigurationgenerated

Runtime configuration, managed by Fabric

Block-Max Pruning

Adapters for Faster Learned Sparse Retrieval with Block-Max Pruning.

XPM Configxpmir.index.sparse.BMPSparseRetrieverIndex(*, documents, index_path)

Bases: AbstractSparseRetrieverIndex

documents: datamaestro_ir.data.DocumentStore

The indexed document collection

index_path: path

The path of the BMP index

XPM Taskxpmir.index.sparse.BMPSparseRetrieverIndexBuilder(*, documents, encoder, batch_size, ordered_index, max_docs, in_memory, checkpoint_frequency, max_postings, block_size, compress_range)

Bases: SparseRetrieverIndexBuilder[InputType]

Submit type: Any

Index using a BMP index

documents: datamaestro_ir.data.DocumentStore

Set of documents to index

encoder: xpmir.text.encoders.TextEncoderBase

The encoder

batcher: xpm_torch.batchers.Batchergenerated

Batcher used when computing representations

batch_size: int

Size of batches

ordered_index: bool

Ordered index: if not ordered, use DAAT strategy (WAND), otherwise, use fast top-k strategies

version: int = 3constant

Version 3 of the index

max_docs: int = 0

Maximum number of indexed documents

in_memory: bool = False

Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

index_path: pathgenerated
checkpoint_frequency: int = 0

Checkpoint frequency (allows recovery at the cost of writing some information to disk)

max_postings: int

Number of postings before dumping a term postings to disk

fabric_config: xpm_torch.configuration.FabricConfigurationgenerated

Runtime configuration, managed by Fabric

block_size: int

The block size

compress_range: bool

Compress the BM index

bmp_index_path: pathgenerated

The final index path

XPM Configxpmir.index.sparse.BMPSparseRetriever(*, store, index, encoder, topk, batchsize, in_memory, alpha, beta)

Bases: SparseRetriever

A Block-Max Pruning retriever

store: datamaestro_ir.data.DocumentStore

Give the document store associated with this retriever

index: xpmir.index.sparse.AbstractSparseRetrieverIndex

The sparse retriever index

encoder: xpmir.text.encoders.TextEncoderBase

Encodes InputType records to text representation output

topk: int

Number of documents to return

batcher: xpm_torch.batchers.Batchergenerated

The way to prepare batches of queries (when using retrieve_all)

batchsize: int

Size of batches (when using retrieve_all)

in_memory: bool = False

Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

alpha: float

Granularity of approximation (0 to 1, 1 = no approximation)

beta: float

Percentage of query tokens to keep (0 to 1, 1 = no pruning)