Retrieval

This page describes the components used for retrieval, i.e. finding the top \(k\) documents from a collection given a query. XPMIR supports classical models (BM25, query-likelihood), dense retrieval (FAISS), sparse learned retrieval, and late-interaction models (ColBERT/PLAID), as well as multi-stage pipelines that combine them.

Base classes 

Core data structures and the abstract retriever interface that all implementations extend.

class xpmir.rankers.ScoredDocument(document: dict, score: float)[source]

Bases: object

A data structure that associates a score with a document, allowing to sort documents by score (e.g., for nDCG)

XPM Configxpmir.rankers.retriever.Retriever(*, store)[source]

Bases: Config, ModuleContainer, ABC

A retriever is a model to return top-scored documents given a query

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

collection()[source]: Returns the document collection object

abstractmethod retrieve(record: IDTextRecord) → List[ScoredDocument][source]

Retrieves documents, returning a list sorted by decreasing score

if content is true, includes the document full text

retrieve_all(queries: Dict[str, IDTextRecord]) → Dict[str, List[ScoredDocument]][source]

Retrieves for a set of documents

By default, iterate using self.retrieve, but this leaves some room open for optimization

Parameters:: queries – A dictionary where the key is the ID of the query, and the value is the text

Standard IR models 

Definitions for classical probabilistic retrieval models. These are backend-agnostic specifications that can be instantiated with a concrete engine such as AnseriniRetriever.

XPM Configxpmir.rankers.standard.Model[source]

Bases: Config

Base class for standard IR models

XPM Configxpmir.rankers.standard.BM25(*, k1, b)[source]

Bases: Model

BM-25 model definition

k1: float = 0.9

b: float = 0.4

XPM Configxpmir.rankers.standard.QLDirichlet(*, mu)[source]

Bases: Model

Query likelihood (Dirichlet smoothing) model definition

mu: float = 1000

Multi-stage retrievers 

In a re-ranking setting, a two-stage retriever first retrieves candidates with a fast first-stage model, then re-scores them with a more expensive scorer.

The re-ranking process is memory-efficient: it uses lazy evaluation of first-stage results and maximises GPU throughput by batching query-document pairs across multiple queries.

XPM Configxpmir.rankers.scorer.AbstractTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize)[source]

Bases: Retriever

Abstract class for all two stage retrievers (i.e. scorers and duo-scorers)

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

retriever: xpmir.rankers.retriever.Retriever: The base retriever

scorer: xpmir.rankers.scorer.Scorer: The scorer used to re-rank the documents

top_k: int: The number of returned documents (if None, returns all the documents)

batchsize: int = 0: The batch size for the re-ranker

XPM Configxpmir.rankers.scorer.TwoStageRetriever(*, store, retriever, scorer, top_k, batchsize)[source]

Bases: AbstractTwoStageRetriever

Use on retriever to select the top-K documents which are the re-ranked given a scorer.

Multi-GPU support:: When set up with a lightning.Fabric instance, retrieve_all() shards the re-ranking task across GPUs and gathers the results. It uses efficient cross-query batching to maximize GPU throughput.

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

retriever: xpmir.rankers.retriever.Retriever: The base retriever

scorer: xpmir.rankers.scorer.Scorer: The scorer used to re-rank the documents

top_k: int: The number of returned documents (if None, returns all the documents)

batchsize: int = 0: The batch size for the re-ranker

Duo-retrievers 

Duo-retrievers predict which of two candidate documents is more relevant to the query (pairwise preference), rather than assigning an absolute score.

XPM Configxpmir.rankers.scorer.DuoTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize)[source]

Bases: AbstractTwoStageRetriever

The two stage retriever for pairwise scorers.

For pairwise scorer, we need to aggregate the pairwise scores in some way.

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

retriever: xpmir.rankers.retriever.Retriever: The base retriever

scorer: xpmir.rankers.scorer.Scorer: The scorer used to re-rank the documents

top_k: int: The number of returned documents (if None, returns all the documents)

batchsize: int = 0: The batch size for the re-ranker

XPM Configxpmir.rankers.scorer.DuoLearnableScorer(*, doc, bibtex)[source]

Bases: AbstractModuleScorer

Base class for models that can score a triplet (query, document 1, document 2)

doc: str: Paper description or title (used in HF Hub README)

bibtex: str: BibTeX citation (used in HF Hub README)

Miscellaneous retrievers 

Utility retrievers for loading pre-computed runs, hydrating results with document text, or exhaustive scoring.

XPM Configxpmir.rankers.full.FullRetriever(*, store, documents)[source]

Bases: Retriever

Retrieves all the documents of the collection

This can be used to build a small validation set on a subset of the collection - in that case, the scorer can be used through a TwoStageRetriever, with this retriever as the base retriever.

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

documents: datamaestro_ir.data.Documents

XPM Configxpmir.rankers.full.FullRetrieverRescorer(*, store, documents, scorer, batchsize, batcher)[source]

Bases: Retriever

Scores all the documents from a collection

Encodes all queries at once, then processes documents in batches, scoring the full query×document matrix each batch. This is more efficient than the TwoStageRetriever approach for small collections.

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

documents: datamaestro_ir.data.Documents: The set of documents to consider

scorer: xpmir.neural.DualRepresentationScorer: The scorer (a dual representation scorer)

batchsize: int = 0

batcher: xpm_torch.batchers.Batcher = xpm_torch.batchers.Batcher()

XPM Configxpmir.rankers.retriever.RetrieverHydrator(*, store, retriever)[source]

Bases: Retriever

Hydrate retrieved results with document text

store: datamaestro_ir.data.DocumentStore: The store for document texts

retriever: xpmir.rankers.retriever.Retriever: The retriever to hydrate

XPM Configxpmir.rankers.retriever.RunRetriever(*, store, run, documents)[source]

Bases: Retriever

A retriever that returns documents from a pre-computed run Can be useful to build a two-stage retriever with precomputed first stage (e.g for validation when training a scorer model)

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

run: datamaestro_ir.data.AdhocRun: The pre-computed run

documents: datamaestro_ir.data.Documents: Associated documents

Distributed Retrieval 

XPMIR supports distributed retrieval and re-ranking across multiple GPUs using Lightning Fabric. Currently, this optimized distributed logic is implemented for :class:`~xpmir.index.sparse.SparseRetriever` and :class:`~xpmir.rankers.scorer.TwoStageRetriever`.

This is particularly useful for large-scale evaluation on thousands of queries.

How it works 

When a retriever is configured with a Fabric instance, the retrieve_all() method leverages all available devices:

Query Sharding: The set of queries is automatically partitioned across the available GPUs.
Parallel Processing: Each device processes its assigned shard. - For Sparse Retrieval (SparseRetriever),

queries are encoded in batches and searched via asynchronous workers.
- For Two-Stage Retrieval (TwoStageRetriever), document re-ranking is batched across queries for maximum throughput.
Result Gathering: Once processing is complete, results are collected from all ranks and merged on the global zero rank.

Usage 

Distributed retrieval is automatically enabled when using the Evaluate task with a multi-GPU FabricConfiguration.

To use it manually in a script:

from lightning import Fabric
fabric = Fabric(devices=2, strategy="ddp")
fabric.launch()

retriever.initialize()
retriever.setup_with_fabric(fabric)

# Distributed retrieval
results = retriever.retrieve_all(queries)

# Results are gathered on rank 0
if fabric.is_global_zero:
    print(f"Total queries retrieved: {len(results)}")

Index backends 

The sections below describe the available index backends and their associated retrievers.

Anserini 

Anserini provides classical inverted-index retrieval (BM25, query-likelihood, etc.) via Lucene.

XPM Configxpmir.index.anserini.Index(*, id, count, file_access, path, storePositions, storeDocvectors, storeRaw, storeContents, stemmer)[source]

Bases: AdhocIndex

Anserini-backed index

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path: Path to the index

storePositions: bool = False: Store term positions

storeDocvectors: bool = False: Store document term vectors

storeRaw: bool = False: Store raw document

storeContents: bool = False: Store processed documents (e.g. without HTML tags)

stemmer: str = porter: The stemmer to use

XPM Configxpmir.interfaces.anserini.AnseriniRetriever(*, store, index, model, k)[source]

Bases: Retriever

An Anserini-based retriever

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

index: xpmir.index.anserini.Index: The Anserini index

model: xpmir.rankers.standard.Model: the model used to search. Only supports BM25 so far.

k: int = 1500: Number of results to retrieve

XPM Taskxpmir.interfaces.anserini.IndexCollection(*, id, count, file_access, storePositions, storeDocvectors, storeRaw, storeContents, stemmer, documents, threads)[source]

Bases: Index, Task

An [Anserini](https://github.com/castorini/anserini) index

id: str: Use an empty ID since identifier is determined by documents

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: pathgenerated

storePositions: bool = False: Store term positions

storeDocvectors: bool = False: Store document term vectors

storeRaw: bool = False: Store raw document

storeContents: bool = False: Store processed documents (e.g. without HTML tags)

stemmer: str = porter: The stemmer to use

documents: datamaestro_ir.data.Documents: The documents to index

threads: int = 8: Number of threads when indexing

XPM Taskxpmir.interfaces.anserini.SearchCollection(*, index, topics, model)[source]

Bases: Task

index: xpmir.index.anserini.Index

topics: datamaestro_ir.data.Topics

model: xpmir.rankers.standard.Model

path: pathgenerated

FAISS 

FAISS provides approximate nearest-neighbour search for dense vector retrieval.

XPM Configxpmir.index.faiss.FaissIndex(*, normalize, documents)[source]

Bases: Config

FAISS Index

normalize: bool: Whether vectors should be normalized (L2)

faiss_index: pathgenerated: Path to the file containing the index

documents: datamaestro_ir.data.DocumentStore: The set of documents

XPM Taskxpmir.index.faiss.IndexBackedFaiss(*, normalize, documents, encoder, batchsize, batcher, hooks, indexspec, sampler)[source]

Bases: FaissIndex, Task

Constructs a FAISS index backed up by an index

During executions, InitializationHooks are used (pre/post)

normalize: bool: Whether vectors should be normalized (L2)

faiss_index: pathgenerated: Path to the file containing the index

documents: datamaestro_ir.data.DocumentStore: The set of documents

encoder: xpmir.text.encoders.TextEncoder: Encoder for document texts

batchsize: int = 1: The batch size used when computing representations of documents

batcher: xpm_torch.batchers.Batcher = xpm_torch.batchers.Batcher(): The way to prepare batches of documents

hooks: List[xpmir.context.Hook] = []: An optional list of hooks

indexspec: str: The index type as a factory string See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes for the full list of indices and https://github.com/facebookresearch/faiss/wiki/The-index-factory for the combination of the index factory

sampler: xpmir.documents.samplers.DocumentSampler: Optional document sampler when training the index – by default, all the documents from the collection are used

XPM Configxpmir.index.faiss.FaissRetriever(*, store, encoder, index, topk)[source]

Bases: Retriever

Retriever based on Faiss

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

encoder: xpmir.text.encoders.TextEncoder: The query encoder

index: xpmir.index.faiss.FaissIndex: The faiss index

topk: int: the number of documents to be retrieved

fast-plaid (ColBERT / PLAID)

Interface to fast-plaid, a Rust-based implementation of PLAID / ColBERT late-interaction retrieval. Per-document token vectors can be reconstructed from the compressed index via get_document_tokens().

XPM Configxpmir.index.plaid.PlaidIndex(*, documents, compress_only, index_path, device, in_memory)[source]

Bases: Config

A ColBERT / PLAID index backed by fast-plaid.

The index stores per-token document embeddings in fast-plaid’s compressed centroid + residual format. Per-document token vectors can be reconstructed (approximately) via get_document_tokens(), which delegates to fast-plaid’s get_embeddings method. The reconstruction quality is controlled by n_bits.

When compress_only is True the index only contains the compressed vectors (centroids + quantised residuals) without the IVF search structure. This is cheaper to build and sufficient when only get_document_tokens() is needed. Attempting to search a compress-only index via PlaidRetriever will raise an error.

documents: datamaestro_ir.data.DocumentStore: Set of documents to index.

compress_only: bool = False

index_path: path: Directory containing the fast-plaid index and side-car files.

device: str: Device used to load the index for get_document_tokens() ("" = auto). Fixed at first use because the underlying FastPlaid instance is cached.

in_memory: bool = False: If True, load the index fully into device memory (passes low_memory=False to fast-plaid). Use when the index fits in VRAM/RAM and you want faster decompression/search; otherwise the document codes and residuals stay memory-mapped from disk.

get_document_tokens(docids: list[int | str], device: str = '') → Tensor[source]

Return the (approximate) per-token embeddings for a document.

The vectors are reconstructed from fast-plaid’s compressed centroid + residual storage using FastPlaid.get_embeddings. The reconstruction quality depends on n_bits.

Parameters:

docid – The document identifiers. Integers are interpreted as internal positions in the index (0..num_docs-1); strings are looked up in the external-to-internal map written at indexing time.
device – Device for the fast-plaid instance used to decompress ("" = auto).

Returns:

A (num_tokens, dim) float tensor containing the reconstructed token embeddings.

XPM Taskxpmir.index.plaid.PlaidIndexBuilder(*, documents, encoder, batch_size, buffer_size, fast_plaid_batch_size, n_bits, kmeans_niters, n_samples_kmeans, max_points_per_centroid, seed, compress_only, low_memory, force_cpu_indexing)[source]

Bases: Task

Submit type: xpmir.index.plaid.PlaidIndex

Builds a fast-plaid index from a document collection.

The builder encodes every document using the given ColBERTEncoder, collects the valid (i.e. non-padding) token vectors, and feeds them to fast-plaid.

The fast-plaid index stores the embeddings in a compressed centroid + residual format, so no separate raw-token file is needed. Per-document token vectors can be reconstructed later via PlaidIndex.get_document_tokens().

documents: datamaestro_ir.data.DocumentStore: Set of documents to index.

encoder: xpmir.text.encoders.TextEncoderBase: The ColBERT-style encoder used to produce per-token embeddings.

batch_size: int = 32: Encoder batch size. Warning, different from the batch size used internally by fast-plaid (‘fast_plaid_batch_size’)

buffer_size: int = 1000: Number of documents to encode and accumulate in RAM before creating/updating the fast-plaid index and fitting the centroids. The token embeddings used to initialize the centroids will be sampled randomly from those documents by plaid (or they will all be used if n_samples_kmeans is 0).

fast_plaid_batch_size: int = 32: Fast plaid internal batch size.

n_bits: int = 2: Number of bits used by fast-plaid for residual quantisation.

kmeans_niters: int = 4: Number of K-means iterations performed by fast-plaid when clustering the centroids.

n_samples_kmeans: int = 0: Number of token samples used to train the centroids (0 = fast-plaid default).

max_points_per_centroid: int = 256: Maximum number of points (documents) per centroid. Controls the creation of new centroids.

seed: int = 42: Random seed for reproducibility (passed to fast-plaid’s index creation).

compress_only: bool = False: When True, skip IVF construction. The resulting index supports PlaidIndex.get_document_tokens() but not search via PlaidRetriever. Requires fast-plaid support for compress_only (see lightonai/fast-plaid#41). Falls back to building the full index with a warning if unsupported.

low_memory: bool = True: https://github.com/lightonai/fast-plaid#-search-speed-tip-low_memoryfalse If index fits on VRAM, set to False for faster search. Otherwise, keep True to avoid OOM errors.

force_cpu_indexing: bool = False: When True, forces the use of CPU for indexing even if a GPU is available. This can be useful to avoid GPU OOM errors during indexing, especially for large corpora.

fabric_config: xpm_torch.configuration.FabricConfigurationgenerated: Control the device for the model encoding and fast-plaid index.

index_path: pathgenerated: Output directory for the index and its side-car files.

XPM Configxpmir.index.plaid.PlaidRetriever(*, store, encoder, index, topk, n_ivf_probe, n_full_scores)[source]

Bases: Retriever

Retriever using a fast-plaid PLAID index.

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

encoder: xpmir.rankers.scorer.AbstractModuleScorer: The query encoder. Typically the same encoder that was used to build index.

index: xpmir.index.plaid.PlaidIndex: The fast-plaid index to search.

topk: int: Number of documents to return per query.

n_ivf_probe: int = 8: Number of inverted-list clusters explored by fast-plaid at search time.

n_full_scores: int = 0: Number of candidates for which fast-plaid computes full scores (0 = fast-plaid default).

fabric_config: xpm_torch.configuration.FabricConfigurationgenerated: Control the device for the model encoding and fast-plaid index.

Sparse retrieval 

Learned sparse retrieval indexes (e.g. for SPLADE), backed by the impact-index Rust library.

XPM Configxpmir.index.sparse.AbstractSparseRetrieverIndex(*, documents)

Bases: Config, ABC

documents: datamaestro_ir.data.DocumentStore: The indexed document collection

XPM Taskxpmir.index.sparse.AbstractSparseRetrieverIndexBuilder(*, documents, encoder, batch_size, ordered_index, max_docs)

Bases: Task, ABC, Generic[InputType]

Builds an index from a sparse representation

Assumes that document and queries have the same dimension, and that the score is computed through an inner product

documents: datamaestro_ir.data.DocumentStore: Set of documents to index

encoder: xpmir.text.encoders.TextEncoderBase: The encoder

batcher: xpm_torch.batchers.Batchergenerated: Batcher used when computing representations

batch_size: int: Size of batches

ordered_index: bool: Ordered index: if not ordered, use DAAT strategy (WAND), otherwise, use fast top-k strategies

version: int = 3constant: Version 3 of the index

max_docs: int = 0: Maximum number of indexed documents

XPM Configxpmir.index.sparse.SparseRetriever(*, store, index, encoder, topk, batchsize, in_memory)

Bases: Retriever, Generic[InputType]

Retriever for learned sparse models (e.g. SPLADE).

This retriever uses a TextEncoderBase to encode queries into sparse vectors, which are then used to search an AbstractSparseRetrieverIndex.

Multi-GPU support:: When set up with a lightning.Fabric instance, retrieve_all() automatically shards the queries across GPUs and merges the results. It also adjusts the number of asynchronous search workers to prevent CPU oversubscription.

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

index: xpmir.index.sparse.AbstractSparseRetrieverIndex: The sparse retriever index

encoder: xpmir.text.encoders.TextEncoderBase: Encodes InputType records to text representation output

topk: int: Number of documents to return

batcher: xpm_torch.batchers.Batchergenerated: The way to prepare batches of queries (when using retrieve_all)

batchsize: int: Size of batches (when using retrieve_all)

in_memory: bool = False: Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

Impact library (Rust)

XPM Configxpmir.index.sparse.SparseRetrieverIndex(*, documents, index_path)

Bases: AbstractSparseRetrieverIndex

documents: datamaestro_ir.data.DocumentStore: The indexed document collection

index_path: path

XPM Taskxpmir.index.sparse.SparseRetrieverIndexBuilder(*, documents, encoder, batch_size, ordered_index, max_docs, in_memory, checkpoint_frequency, max_postings)

Bases: AbstractSparseRetrieverIndexBuilder[InputType]

Submit type: Any

documents: datamaestro_ir.data.DocumentStore: Set of documents to index

encoder: xpmir.text.encoders.TextEncoderBase: The encoder

batcher: xpm_torch.batchers.Batchergenerated: Batcher used when computing representations

batch_size: int: Size of batches

ordered_index: bool: Ordered index: if not ordered, use DAAT strategy (WAND), otherwise, use fast top-k strategies

version: int = 3constant: Version 3 of the index

max_docs: int = 0: Maximum number of indexed documents

in_memory: bool = False: Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

index_path: pathgenerated

checkpoint_frequency: int = 0: Checkpoint frequency (allows recovery at the cost of writing some information to disk)

max_postings: int: Number of postings before dumping a term postings to disk

fabric_config: xpm_torch.configuration.FabricConfigurationgenerated: Runtime configuration, managed by Fabric

Block-Max Pruning

Adapters for Faster Learned Sparse Retrieval with Block-Max Pruning.

XPM Configxpmir.index.sparse.BMPSparseRetrieverIndex(*, documents, index_path)

Bases: AbstractSparseRetrieverIndex

documents: datamaestro_ir.data.DocumentStore: The indexed document collection

index_path: path: The path of the BMP index

XPM Taskxpmir.index.sparse.BMPSparseRetrieverIndexBuilder(*, documents, encoder, batch_size, ordered_index, max_docs, in_memory, checkpoint_frequency, max_postings, block_size, compress_range)

Bases: SparseRetrieverIndexBuilder[InputType]

Submit type: Any

Index using a BMP index

documents: datamaestro_ir.data.DocumentStore: Set of documents to index

encoder: xpmir.text.encoders.TextEncoderBase: The encoder

batcher: xpm_torch.batchers.Batchergenerated: Batcher used when computing representations

batch_size: int: Size of batches

ordered_index: bool: Ordered index: if not ordered, use DAAT strategy (WAND), otherwise, use fast top-k strategies

version: int = 3constant: Version 3 of the index

max_docs: int = 0: Maximum number of indexed documents

in_memory: bool = False: Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

index_path: pathgenerated

checkpoint_frequency: int = 0: Checkpoint frequency (allows recovery at the cost of writing some information to disk)

max_postings: int: Number of postings before dumping a term postings to disk

fabric_config: xpm_torch.configuration.FabricConfigurationgenerated: Runtime configuration, managed by Fabric

block_size: int: The block size

compress_range: bool: Compress the BM index

bmp_index_path: pathgenerated: The final index path

XPM Configxpmir.index.sparse.BMPSparseRetriever(*, store, index, encoder, topk, batchsize, in_memory, alpha, beta)

Bases: SparseRetriever

A Block-Max Pruning retriever

store: datamaestro_ir.data.DocumentStore: Give the document store associated with this retriever

index: xpmir.index.sparse.AbstractSparseRetrieverIndex: The sparse retriever index

encoder: xpmir.text.encoders.TextEncoderBase: Encodes InputType records to text representation output

topk: int: Number of documents to return

batcher: xpm_torch.batchers.Batchergenerated: The way to prepare batches of queries (when using retrieve_all)

batchsize: int: Size of batches (when using retrieve_all)

in_memory: bool = False: Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

alpha: float: Granularity of approximation (0 to 1, 1 = no approximation)

beta: float: Percentage of query tokens to keep (0 to 1, 1 = no pruning)