Retrieval

This page describes the different configurations/tasks needed for retrieval, i.e. searching the a subset of \(k\) documents given a query.

Base class shows the main class used for retrieval,
Standard IR models describes the configurations for standard IR models like BM25,
Multi-stage retrievers describes the configurations handling multi-stage retrieval (e.g. two-stage retriever)
`Factories`_ describe utility classes and decorators that can be used to build retrievers that depend on a dataset.

Finally, retrieval interfaces to other libraries are given for Anserini, FAISS.

Base class

class xpmir.rankers.ScoredDocument(document: Document, score: float)[source]

Bases: object

A data structure that associated a score with a document

XPM Configxpmir.rankers.Retriever(*, store)[source]

Bases: Config

A retriever is a model to return top-scored documents given a query

store: datamaestro_text.data.ir.DocumentStore: Give the document store associated with this retriever

collection()[source]: Returns the document collection object

retrieve(query: str) → List[ScoredDocument][source]

Retrieves documents, returning a list sorted by decreasing score

if content is true, includes the document full text

retrieve_all(queries: Dict[str, str]) → Dict[str, List[ScoredDocument]][source]

Retrieves for a set of documents

By default, iterate using self.retrieve, but this leaves some room open for optimization

Parameters:: queries – A dictionary where the key is the ID of the query, and the value is the text

Standard IR models

Standard IR models are definitions that can be used by a specific instance, like e.g. xpmir.interfaces.anserini.AnseriniRetriever

XPM Configxpmir.rankers.standard.Model[source]

Bases: Config

Base class for standard IR models

XPM Configxpmir.rankers.standard.BM25(*, k1, b)[source]

Bases: Model

BM-25 model definition

k1: float = 0.9

b: float = 0.4

XPM Configxpmir.rankers.standard.QLDirichlet(*, mu)[source]

Bases: Model

Query likelihood (Dirichlet smoothing) model definition

mu: float = 1000

Multi-stage retrievers

In a re-ranking setting, one can use a two stage retriever to perform retrieval, by using a fully fledge retriever first, and then re-ranking the results.

XPM Configxpmir.rankers.AbstractTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize, batcher, device)[source]

Bases: Retriever

Abstract class for all two stage retrievers (i.e. scorers and duo-scorers)

store: datamaestro_text.data.ir.DocumentStore: Give the document store associated with this retriever

retriever: xpmir.rankers.Retriever: The base retriever

scorer: xpmir.rankers.Scorer: The scorer used to re-rank the documents

top_k: int: The number of returned documents (if None, returns all the documents)

batchsize: int = 0: The batch size for the re-ranker

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher(): How to provide batches of documents

device: xpmir.learning.devices.Device: Device on which the model is run

XPM Configxpmir.rankers.TwoStageRetriever(*, store, retriever, scorer, top_k, batchsize, batcher, device)[source]

Bases: AbstractTwoStageRetriever

Use on retriever to select the top-K documents which are the re-ranked given a scorer

store: datamaestro_text.data.ir.DocumentStore: Give the document store associated with this retriever

retriever: xpmir.rankers.Retriever: The base retriever

scorer: xpmir.rankers.Scorer: The scorer used to re-rank the documents

top_k: int: The number of returned documents (if None, returns all the documents)

batchsize: int = 0: The batch size for the re-ranker

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher(): How to provide batches of documents

device: xpmir.learning.devices.Device: Device on which the model is run

Duo-retrievers

Duo-retrievers only predicts whether a document is “more relevant” than another

XPM Configxpmir.rankers.DuoTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize, batcher, device)[source]

Bases: AbstractTwoStageRetriever

The two stage retriever for pairwise scorers.

For pairwise scorer, we need to aggregate the pairwise scores in some way.

store: datamaestro_text.data.ir.DocumentStore: Give the document store associated with this retriever

retriever: xpmir.rankers.Retriever: The base retriever

scorer: xpmir.rankers.Scorer: The scorer used to re-rank the documents

top_k: int: The number of returned documents (if None, returns all the documents)

batchsize: int = 0: The batch size for the re-ranker

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher(): How to provide batches of documents

device: xpmir.learning.devices.Device: Device on which the model is run

XPM Configxpmir.rankers.DuoLearnableScorer[source]

Bases: LearnableScorer

Base class for models that can score a triplet (query, document 1, document 2)

Misc

XPM Configxpmir.rankers.full.FullRetriever(*, store, documents)[source]

Bases: Retriever

Retrieves all the documents of the collection

This can be used to build a small validation set on a subset of the collection - in that case, the scorer can be used through a TwoStageRetriever, with this retriever as the base retriever.

store: datamaestro_text.data.ir.DocumentStore: Give the document store associated with this retriever

documents: datamaestro_text.data.ir.Documents

XPM Configxpmir.rankers.full.FullRetrieverRescorer(*, store, documents, scorer, batchsize, batcher, device)[source]

Bases: Retriever

Scores all the documents from a collection

store: datamaestro_text.data.ir.DocumentStore: Give the document store associated with this retriever

documents: datamaestro_text.data.ir.Documents: The set of documents to consider

scorer: xpmir.neural.DualRepresentationScorer: The scorer (a dual representation scorer)

batchsize: int = 0

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher()

device: xpmir.learning.devices.Device

XPM Configxpmir.rankers.RetrieverHydrator(*, store, retriever)[source]

Bases: Retriever

Hydrate retrieved results with document text

store: datamaestro_text.data.ir.DocumentStore: The store for document texts

retriever: xpmir.rankers.Retriever: The retriever to hydrate

XPM Configxpmir.rankers.mergers.SumRetriever(*, store, retrievers, weights)[source]

Bases: Retriever

Combines the scores of various retrievers

store: datamaestro_text.data.ir.DocumentStore: Give the document store associated with this retriever

retrievers: List[xpmir.rankers.Retriever]: retrievers

weights: List[int]: The weights of the retrievers

Collection dependendant

Anserini

XPM Configxpmir.index.anserini.Index(*, id, count, path, storePositions, storeDocvectors, storeRaw, storeContents, stemmer)[source]

Bases: AdhocIndex

Anserini-backed index

id: str: The unique dataset ID

count: int: Number of documents

path: Path: Path to the index

storePositions: bool = False: Store term positions

storeDocvectors: bool = False: Store document term vectors

storeRaw: bool = False: Store raw document

storeContents: bool = False: Store processed documents (e.g. without HTML tags)

stemmer: str = porter: The stemmer to use

XPM Configxpmir.interfaces.anserini.AnseriniRetriever(*, store, index, model, k)[source]

Bases: Retriever

An Anserini-based retriever

store: datamaestro_text.data.ir.DocumentStore: Give the document store associated with this retriever

index: xpmir.index.anserini.Index: The Anserini index

model: xpmir.rankers.standard.Model: the model used to search. Only suupports BM25 so far.

k: int = 1500: Number of results to retrieve

XPM Taskxpmir.interfaces.anserini.IndexCollection(*, id, count, storePositions, storeDocvectors, storeRaw, storeContents, stemmer, threads, documents, thread)[source]

Bases: Index, Task

An [Anserini](https://github.com/castorini/anserini) index

id: str: Use an empty ID since identifier is determined by documents

count: int: Number of documents

path: Pathgenerated

storePositions: bool = False: Store term positions

storeDocvectors: bool = False: Store document term vectors

storeRaw: bool = False: Store raw document

storeContents: bool = False: Store processed documents (e.g. without HTML tags)

stemmer: str = porter: The stemmer to use

threads: int = 8

documents: datamaestro_text.data.ir.Documents: The documents to index

thread: int = 8: Number of threads when indexing

XPM Taskxpmir.interfaces.anserini.SearchCollection(*, model, topics, index)[source]

Bases: Task

path: Pathgenerated

model: xpmir.rankers.standard.Model

topics: datamaestro_text.data.ir.Topics

index: xpmir.index.anserini.Index

FAISS

XPM Configxpmir.index.faiss.FaissIndex(*, normalize, documents)[source]

Bases: Config

FAISS Index

normalize: bool: Whether vectors should be normalized (L2)

faiss_index: Pathgenerated: Path to the file containing the index

documents: datamaestro_text.data.ir.DocumentStore: The set of documents

XPM Taskxpmir.index.faiss.IndexBackedFaiss(*, normalize, documents, encoder, batchsize, device, batcher, hooks, indexspec, sampler)[source]

Bases: FaissIndex, Task

Constructs a FAISS index backed up by an index

During executions, InitializationHooks are used (pre/post)

normalize: bool: Whether vectors should be normalized (L2)

faiss_index: Pathgenerated: Path to the file containing the index

documents: datamaestro_text.data.ir.DocumentStore: The set of documents

encoder: xpmir.text.encoders.TextEncoder: Encoder for document texts

batchsize: int = 1: The batch size used when computing representations of documents

device: xpmir.learning.devices.Device = xpmir.learning.devices.Device(): The device used by the encoder

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher(): The way to prepare batches of documents

hooks: List[xpmir.context.Hook] = []: An optional list of hooks

indexspec: str: The index type as a factory string See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes for the full list of indices and https://github.com/facebookresearch/faiss/wiki/The-index-factory for the combination of the index factory

sampler: xpmir.documents.samplers.DocumentSampler: Optional document sampler when training the index – by default, all the documents from the collection are used

XPM Configxpmir.index.faiss.FaissRetriever(*, store, encoder, index, topk)[source]

Bases: Retriever

Retriever based on Faiss

store: datamaestro_text.data.ir.DocumentStore: Give the document store associated with this retriever

encoder: xpmir.text.encoders.TextEncoder: The query encoder

index: xpmir.index.faiss.FaissIndex: The faiss index

topk: int: the number of documents to be retrieved

Sparse

XPM Configxpmir.index.sparse.SparseRetriever(*, store, index, encoder, topk, batcher, batchsize, in_memory)[source]

Bases: Retriever

store: datamaestro_text.data.ir.DocumentStore: Give the document store associated with this retriever

index: xpmir.index.sparse.SparseRetrieverIndex

encoder: xpmir.text.encoders.TextEncoder

topk: int

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher(): The way to prepare batches of queries (when using retrieve_all)

batchsize: int: Size of batches (when using retrieve_all)

in_memory: bool = False: Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

XPM Configxpmir.index.sparse.SparseRetrieverIndex(*, index_path, documents)[source]

Bases: Config

index_path: Path

documents: datamaestro_text.data.ir.DocumentStore

XPM Taskxpmir.index.sparse.SparseRetrieverIndexBuilder(*, documents, encoder, batcher, batch_size, ordered_index, device, max_postings, in_memory, max_docs)[source]

Bases: Task

Submit type: Any

Builds an index from a sparse representation

Assumes that document and queries have the same dimension, and that the score is computed through an inner product

documents: datamaestro_text.data.ir.DocumentStore: Set of documents to index

encoder: xpmir.text.encoders.TextEncoder: The encoder

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher(): Batcher used when computing representations

batch_size: int: Size of batches

ordered_index: bool: Ordered index: if not ordered, use DAAT strategy (WAND), otherwise, use fast top-k strategies

device: xpmir.learning.devices.Device = xpmir.learning.devices.Device()

max_postings: int = 16384: Maximum number of postings (per term) before flushing to disk

index_path: Pathgenerated

in_memory: bool = False: Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

version: int = 3constant: Version 3 of the index

max_docs: int = 0: Maximum number of indexed documents