Retrieval
This page describes the components used for retrieval, i.e. finding the top \(k\) documents from a collection given a query. XPMIR supports classical models (BM25, query-likelihood), dense retrieval (FAISS), sparse learned retrieval, and late-interaction models (ColBERT/PLAID), as well as multi-stage pipelines that combine them.
Base classes
Core data structures and the abstract retriever interface that all implementations extend.
- class xpmir.rankers.ScoredDocument(document: dict, score: float)[source]
Bases:
objectA data structure that associates a score with a document, allowing to sort documents by score (e.g., for nDCG)
- XPM Configxpmir.rankers.retriever.Retriever(*, store)[source]
Bases:
Config,ModuleContainer,ABCA retriever is a model to return top-scored documents given a query
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- abstractmethod retrieve(record: IDTextRecord) List[ScoredDocument][source]
Retrieves documents, returning a list sorted by decreasing score
if content is true, includes the document full text
- retrieve_all(queries: Dict[str, IDTextRecord]) Dict[str, List[ScoredDocument]][source]
Retrieves for a set of documents
By default, iterate using self.retrieve, but this leaves some room open for optimization
- Parameters:
queries – A dictionary where the key is the ID of the query, and the value is the text
Standard IR models
Definitions for classical probabilistic retrieval models. These are
backend-agnostic specifications that can be instantiated with a concrete
engine such as AnseriniRetriever.
Multi-stage retrievers
In a re-ranking setting, a two-stage retriever first retrieves candidates with a fast first-stage model, then re-scores them with a more expensive scorer.
The re-ranking process is memory-efficient: it uses lazy evaluation of first-stage results and maximises GPU throughput by batching query-document pairs across multiple queries.
- XPM Configxpmir.rankers.scorer.AbstractTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize)[source]
Bases:
RetrieverAbstract class for all two stage retrievers (i.e. scorers and duo-scorers)
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- retriever: xpmir.rankers.retriever.Retriever
The base retriever
- scorer: xpmir.rankers.scorer.Scorer
The scorer used to re-rank the documents
- XPM Configxpmir.rankers.scorer.TwoStageRetriever(*, store, retriever, scorer, top_k, batchsize)[source]
Bases:
AbstractTwoStageRetrieverUse on retriever to select the top-K documents which are the re-ranked given a scorer.
- Multi-GPU support:
When set up with a
lightning.Fabricinstance,retrieve_all()shards the re-ranking task across GPUs and gathers the results. It uses efficient cross-query batching to maximize GPU throughput.
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- retriever: xpmir.rankers.retriever.Retriever
The base retriever
- scorer: xpmir.rankers.scorer.Scorer
The scorer used to re-rank the documents
Duo-retrievers
Duo-retrievers predict which of two candidate documents is more relevant to the query (pairwise preference), rather than assigning an absolute score.
- XPM Configxpmir.rankers.scorer.DuoTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize)[source]
Bases:
AbstractTwoStageRetrieverThe two stage retriever for pairwise scorers.
For pairwise scorer, we need to aggregate the pairwise scores in some way.
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- retriever: xpmir.rankers.retriever.Retriever
The base retriever
- scorer: xpmir.rankers.scorer.Scorer
The scorer used to re-rank the documents
Miscellaneous retrievers
Utility retrievers for loading pre-computed runs, hydrating results with document text, or exhaustive scoring.
- XPM Configxpmir.rankers.full.FullRetriever(*, store, documents)[source]
Bases:
RetrieverRetrieves all the documents of the collection
This can be used to build a small validation set on a subset of the collection - in that case, the scorer can be used through a TwoStageRetriever, with this retriever as the base retriever.
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- documents: datamaestro_ir.data.Documents
- XPM Configxpmir.rankers.full.FullRetrieverRescorer(*, store, documents, scorer, batchsize, batcher)[source]
Bases:
RetrieverScores all the documents from a collection
Encodes all queries at once, then processes documents in batches, scoring the full query×document matrix each batch. This is more efficient than the TwoStageRetriever approach for small collections.
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- documents: datamaestro_ir.data.Documents
The set of documents to consider
- scorer: xpmir.neural.DualRepresentationScorer
The scorer (a dual representation scorer)
- batcher: xpm_torch.batchers.Batcher = xpm_torch.batchers.Batcher()
- XPM Configxpmir.rankers.retriever.RetrieverHydrator(*, store, retriever)[source]
Bases:
RetrieverHydrate retrieved results with document text
- store: datamaestro_ir.data.DocumentStore
The store for document texts
- retriever: xpmir.rankers.retriever.Retriever
The retriever to hydrate
- XPM Configxpmir.rankers.retriever.RunRetriever(*, store, run, documents)[source]
Bases:
RetrieverA retriever that returns documents from a pre-computed run Can be useful to build a two-stage retriever with precomputed first stage (e.g for validation when training a scorer model)
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- run: datamaestro_ir.data.AdhocRun
The pre-computed run
- documents: datamaestro_ir.data.Documents
Associated documents
Distributed Retrieval
XPMIR supports distributed retrieval and re-ranking across multiple GPUs using Lightning Fabric. Currently, this optimized distributed logic is implemented for :class:`~xpmir.index.sparse.SparseRetriever` and :class:`~xpmir.rankers.scorer.TwoStageRetriever`.
This is particularly useful for large-scale evaluation on thousands of queries.
How it works
When a retriever is configured with a Fabric instance, the retrieve_all()
method leverages all available devices:
Query Sharding: The set of queries is automatically partitioned across the available GPUs.
Parallel Processing: Each device processes its assigned shard. - For Sparse Retrieval (
SparseRetriever),queries are encoded in batches and searched via asynchronous workers.
For Two-Stage Retrieval (
TwoStageRetriever), document re-ranking is batched across queries for maximum throughput.
Result Gathering: Once processing is complete, results are collected from all ranks and merged on the global zero rank.
Usage
Distributed retrieval is automatically enabled when using the Evaluate
task with a multi-GPU FabricConfiguration.
To use it manually in a script:
from lightning import Fabric
fabric = Fabric(devices=2, strategy="ddp")
fabric.launch()
retriever.initialize()
retriever.setup_with_fabric(fabric)
# Distributed retrieval
results = retriever.retrieve_all(queries)
# Results are gathered on rank 0
if fabric.is_global_zero:
print(f"Total queries retrieved: {len(results)}")
Index backends
The sections below describe the available index backends and their associated retrievers.
Anserini
Anserini provides classical inverted-index retrieval (BM25, query-likelihood, etc.) via Lucene.
- XPM Configxpmir.index.anserini.Index(*, id, count, file_access, path, storePositions, storeDocvectors, storeRaw, storeContents, stemmer)[source]
Bases:
AdhocIndexAnserini-backed index
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
Path to the index
- XPM Configxpmir.interfaces.anserini.AnseriniRetriever(*, store, index, model, k)[source]
Bases:
RetrieverAn Anserini-based retriever
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- index: xpmir.index.anserini.Index
The Anserini index
- model: xpmir.rankers.standard.Model
the model used to search. Only supports BM25 so far.
- XPM Taskxpmir.interfaces.anserini.IndexCollection(*, id, count, file_access, storePositions, storeDocvectors, storeRaw, storeContents, stemmer, documents, threads)[source]
-
An [Anserini](https://github.com/castorini/anserini) index
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: pathgenerated
- documents: datamaestro_ir.data.Documents
The documents to index
- XPM Taskxpmir.interfaces.anserini.SearchCollection(*, index, topics, model)[source]
Bases:
Task- index: xpmir.index.anserini.Index
- topics: datamaestro_ir.data.Topics
- path: pathgenerated
FAISS
FAISS provides approximate nearest-neighbour search for dense vector retrieval.
- XPM Configxpmir.index.faiss.FaissIndex(*, normalize, documents)[source]
Bases:
ConfigFAISS Index
- faiss_index: pathgenerated
Path to the file containing the index
- documents: datamaestro_ir.data.DocumentStore
The set of documents
- XPM Taskxpmir.index.faiss.IndexBackedFaiss(*, normalize, documents, encoder, batchsize, batcher, hooks, indexspec, sampler)[source]
Bases:
FaissIndex,TaskConstructs a FAISS index backed up by an index
During executions, InitializationHooks are used (pre/post)
- faiss_index: pathgenerated
Path to the file containing the index
- documents: datamaestro_ir.data.DocumentStore
The set of documents
- encoder: xpmir.text.encoders.TextEncoder
Encoder for document texts
- batcher: xpm_torch.batchers.Batcher = xpm_torch.batchers.Batcher()
The way to prepare batches of documents
- hooks: List[xpmir.context.Hook] = []
An optional list of hooks
- indexspec: str
The index type as a factory string See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes for the full list of indices and https://github.com/facebookresearch/faiss/wiki/The-index-factory for the combination of the index factory
- sampler: xpmir.documents.samplers.DocumentSampler
Optional document sampler when training the index – by default, all the documents from the collection are used
- XPM Configxpmir.index.faiss.FaissRetriever(*, store, encoder, index, topk)[source]
Bases:
RetrieverRetriever based on Faiss
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- encoder: xpmir.text.encoders.TextEncoder
The query encoder
- index: xpmir.index.faiss.FaissIndex
The faiss index
fast-plaid (ColBERT / PLAID)
Interface to fast-plaid, a
Rust-based implementation of PLAID / ColBERT late-interaction retrieval.
Per-document token vectors can be reconstructed from the compressed index
via get_document_tokens().
- XPM Configxpmir.index.plaid.PlaidIndex(*, documents, compress_only, index_path)[source]
Bases:
ConfigA ColBERT / PLAID index backed by fast-plaid.
The index stores per-token document embeddings in fast-plaid’s compressed centroid + residual format. Per-document token vectors can be reconstructed (approximately) via
get_document_tokens(), which delegates to fast-plaid’sget_embeddingsmethod. The reconstruction quality is controlled byn_bits.When
compress_onlyisTruethe index only contains the compressed vectors (centroids + quantised residuals) without the IVF search structure. This is cheaper to build and sufficient when onlyget_document_tokens()is needed. Attempting to search a compress-only index viaPlaidRetrieverwill raise an error.- documents: datamaestro_ir.data.DocumentStore
Set of documents to index.
- index_path: path
Directory containing the fast-plaid index and side-car files.
- get_document_tokens(docid: int | str, device: str = '') Tensor[source]
Return the (approximate) per-token embeddings for a document.
The vectors are reconstructed from fast-plaid’s compressed centroid + residual storage using
FastPlaid.get_embeddings. The reconstruction quality depends onn_bits.- Parameters:
docid – The document identifier. Integers are interpreted as internal positions in the index (
0..num_docs-1); strings are looked up in the external-to-internal map written at indexing time.device – Device for the fast-plaid instance used to decompress (
""= auto).
- Returns:
A
(num_tokens, dim)float tensor containing the reconstructed token embeddings.- Raises:
KeyError – if the external identifier is unknown.
- XPM Taskxpmir.index.plaid.PlaidIndexBuilder(*, documents, encoder, batch_size, n_bits, kmeans_niters, n_samples_kmeans, compress_only, low_memory, device)[source]
Bases:
TaskSubmit type:
xpmir.index.plaid.PlaidIndexBuilds a fast-plaid index from a document collection.
The builder encodes every document using the given
ColBERTEncoder, collects the valid (i.e. non-padding) token vectors, and feeds them tofast-plaid.The fast-plaid index stores the embeddings in a compressed centroid + residual format, so no separate raw-token file is needed. Per-document token vectors can be reconstructed later via
PlaidIndex.get_document_tokens().- documents: datamaestro_ir.data.DocumentStore
Set of documents to index.
- encoder: xpmir.text.encoders.TextEncoderBase
The ColBERT-style encoder used to produce per-token embeddings.
- kmeans_niters: int = 4
Number of K-means iterations performed by fast-plaid when clustering the centroids.
- n_samples_kmeans: int = 0
Number of token samples used to train the centroids (0 = fast-plaid default).
- compress_only: bool = False
When
True, skip IVF construction. The resulting index supportsPlaidIndex.get_document_tokens()but not search viaPlaidRetriever. Requires fast-plaid support forcompress_only(see lightonai/fast-plaid#41). Falls back to building the full index with a warning if unsupported.
- low_memory: bool = True
https://github.com/lightonai/fast-plaid#-search-speed-tip-low_memoryfalse If index fits on VRAM, set to False for faster search. Otherwise, keep True to avoid OOM errors.
- fabric_config: xpm_torch.configuration.FabricConfigurationgenerated
Control the device for the model encoding (separate from
devicewhich controls the fast-plaid side).
- index_path: pathgenerated
Output directory for the index and its side-car files.
- XPM Configxpmir.index.plaid.PlaidRetriever(*, store, encoder, index, topk, n_ivf_probe, n_full_scores, device)[source]
Bases:
RetrieverRetriever using a fast-plaid PLAID index.
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- encoder: xpmir.rankers.scorer.AbstractModuleScorer
The query encoder. Typically the same encoder that was used to build
index.
- index: xpmir.index.plaid.PlaidIndex
The fast-plaid index to search.
Sparse retrieval
Learned sparse retrieval indexes (e.g. for SPLADE), backed by the impact-index Rust library.
- XPM Configxpmir.index.sparse.AbstractSparseRetrieverIndex(*, documents)
-
- documents: datamaestro_ir.data.DocumentStore
The indexed document collection
- XPM Taskxpmir.index.sparse.AbstractSparseRetrieverIndexBuilder(*, documents, encoder, batch_size, ordered_index, max_docs)
Bases:
Task,ABC,Generic[InputType]Builds an index from a sparse representation
Assumes that document and queries have the same dimension, and that the score is computed through an inner product
- documents: datamaestro_ir.data.DocumentStore
Set of documents to index
- encoder: xpmir.text.encoders.TextEncoderBase
The encoder
- batcher: xpm_torch.batchers.Batchergenerated
Batcher used when computing representations
- XPM Configxpmir.index.sparse.SparseRetriever(*, store, index, encoder, topk, batchsize, in_memory)
Bases:
Retriever,Generic[InputType]Retriever for learned sparse models (e.g. SPLADE).
This retriever uses a
TextEncoderBaseto encode queries into sparse vectors, which are then used to search anAbstractSparseRetrieverIndex.- Multi-GPU support:
When set up with a
lightning.Fabricinstance,retrieve_all()automatically shards the queries across GPUs and merges the results. It also adjusts the number of asynchronous search workers to prevent CPU oversubscription.
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- index: xpmir.index.sparse.AbstractSparseRetrieverIndex
The sparse retriever index
- encoder: xpmir.text.encoders.TextEncoderBase
Encodes InputType records to text representation output
- batcher: xpm_torch.batchers.Batchergenerated
The way to prepare batches of queries (when using retrieve_all)
Impact library (Rust)
- XPM Configxpmir.index.sparse.SparseRetrieverIndex(*, documents, index_path)
Bases:
AbstractSparseRetrieverIndex- documents: datamaestro_ir.data.DocumentStore
The indexed document collection
- index_path: path
- XPM Taskxpmir.index.sparse.SparseRetrieverIndexBuilder(*, documents, encoder, batch_size, ordered_index, max_docs, in_memory, checkpoint_frequency, max_postings)
Bases:
AbstractSparseRetrieverIndexBuilder[InputType]Submit type:
Any- documents: datamaestro_ir.data.DocumentStore
Set of documents to index
- encoder: xpmir.text.encoders.TextEncoderBase
The encoder
- batcher: xpm_torch.batchers.Batchergenerated
Batcher used when computing representations
- ordered_index: bool
Ordered index: if not ordered, use DAAT strategy (WAND), otherwise, use fast top-k strategies
- in_memory: bool = False
Whether the index should be fully loaded in memory (otherwise, uses virtual memory)
- index_path: pathgenerated
- checkpoint_frequency: int = 0
Checkpoint frequency (allows recovery at the cost of writing some information to disk)
- fabric_config: xpm_torch.configuration.FabricConfigurationgenerated
Runtime configuration, managed by Fabric
Block-Max Pruning
Adapters for Faster Learned Sparse Retrieval with Block-Max Pruning.
- XPM Configxpmir.index.sparse.BMPSparseRetrieverIndex(*, documents, index_path)
Bases:
AbstractSparseRetrieverIndex- documents: datamaestro_ir.data.DocumentStore
The indexed document collection
- index_path: path
The path of the BMP index
- XPM Taskxpmir.index.sparse.BMPSparseRetrieverIndexBuilder(*, documents, encoder, batch_size, ordered_index, max_docs, in_memory, checkpoint_frequency, max_postings, block_size, compress_range)
Bases:
SparseRetrieverIndexBuilder[InputType]Submit type:
AnyIndex using a BMP index
- documents: datamaestro_ir.data.DocumentStore
Set of documents to index
- encoder: xpmir.text.encoders.TextEncoderBase
The encoder
- batcher: xpm_torch.batchers.Batchergenerated
Batcher used when computing representations
- ordered_index: bool
Ordered index: if not ordered, use DAAT strategy (WAND), otherwise, use fast top-k strategies
- in_memory: bool = False
Whether the index should be fully loaded in memory (otherwise, uses virtual memory)
- index_path: pathgenerated
- checkpoint_frequency: int = 0
Checkpoint frequency (allows recovery at the cost of writing some information to disk)
- fabric_config: xpm_torch.configuration.FabricConfigurationgenerated
Runtime configuration, managed by Fabric
- bmp_index_path: pathgenerated
The final index path
- XPM Configxpmir.index.sparse.BMPSparseRetriever(*, store, index, encoder, topk, batchsize, in_memory, alpha, beta)
Bases:
SparseRetrieverA Block-Max Pruning retriever
- store: datamaestro_ir.data.DocumentStore
Give the document store associated with this retriever
- index: xpmir.index.sparse.AbstractSparseRetrieverIndex
The sparse retriever index
- encoder: xpmir.text.encoders.TextEncoderBase
Encodes InputType records to text representation output
- batcher: xpm_torch.batchers.Batchergenerated
The way to prepare batches of queries (when using retrieve_all)