Retrieval

This page describes the different configurations/tasks needed for retrieval, i.e. searching the a subset of \(k\) documents given a query.

  • Base class shows the main class used for retrieval,

  • Standard IR models describes the configurations for standard IR models like BM25,

  • Multi-stage retrievers describes the configurations handling multi-stage retrieval (e.g. two-stage retriever)

  • `Factories`_ describe utility classes and decorators that can be used to build retrievers that depend on a dataset.

Finally, retrieval interfaces to other libraries are given for Anserini, FAISS.

Base class

class xpmir.rankers.ScoredDocument(document: Record, score: float)[source]

Bases: object

A data structure that associated a score with a document

XPM Configxpmir.rankers.Retriever(*, store)[source]

Bases: Config, ABC

Submit type: xpmir.rankers.Retriever

A retriever is a model to return top-scored documents given a query

store: datamaestro_text.data.ir.DocumentStore

Give the document store associated with this retriever

collection()[source]

Returns the document collection object

abstract retrieve(record: Record) List[ScoredDocument][source]

Retrieves documents, returning a list sorted by decreasing score

if content is true, includes the document full text

retrieve_all(queries: Dict[str, Record]) Dict[str, List[ScoredDocument]][source]

Retrieves for a set of documents

By default, iterate using self.retrieve, but this leaves some room open for optimization

Parameters:

queries – A dictionary where the key is the ID of the query, and the value is the text

Standard IR models

Standard IR models are definitions that can be used by a specific instance, like e.g. xpmir.interfaces.anserini.AnseriniRetriever

XPM Configxpmir.rankers.standard.Model[source]

Bases: Config

Submit type: xpmir.rankers.standard.Model

Base class for standard IR models

XPM Configxpmir.rankers.standard.BM25(*, k1, b)[source]

Bases: Model

Submit type: xpmir.rankers.standard.BM25

BM-25 model definition

k1: float = 0.9
b: float = 0.4
XPM Configxpmir.rankers.standard.QLDirichlet(*, mu)[source]

Bases: Model

Submit type: xpmir.rankers.standard.QLDirichlet

Query likelihood (Dirichlet smoothing) model definition

mu: float = 1000

Multi-stage retrievers

In a re-ranking setting, one can use a two stage retriever to perform retrieval, by using a fully fledge retriever first, and then re-ranking the results.

XPM Configxpmir.rankers.AbstractTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize, batcher, device)[source]

Bases: Retriever

Submit type: xpmir.rankers.AbstractTwoStageRetriever

Abstract class for all two stage retrievers (i.e. scorers and duo-scorers)

store: datamaestro_text.data.ir.DocumentStore

Give the document store associated with this retriever

retriever: xpmir.rankers.Retriever

The base retriever

scorer: xpmir.rankers.Scorer

The scorer used to re-rank the documents

top_k: int

The number of returned documents (if None, returns all the documents)

batchsize: int = 0

The batch size for the re-ranker

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher.XPMValue()

How to provide batches of documents

device: xpmir.learning.devices.Device

Device on which the model is run

XPM Configxpmir.rankers.TwoStageRetriever(*, store, retriever, scorer, top_k, batchsize, batcher, device)[source]

Bases: AbstractTwoStageRetriever

Submit type: xpmir.rankers.TwoStageRetriever

Use on retriever to select the top-K documents which are the re-ranked given a scorer

store: datamaestro_text.data.ir.DocumentStore

Give the document store associated with this retriever

retriever: xpmir.rankers.Retriever

The base retriever

scorer: xpmir.rankers.Scorer

The scorer used to re-rank the documents

top_k: int

The number of returned documents (if None, returns all the documents)

batchsize: int = 0

The batch size for the re-ranker

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher.XPMValue()

How to provide batches of documents

device: xpmir.learning.devices.Device

Device on which the model is run

Duo-retrievers

Duo-retrievers only predicts whether a document is “more relevant” than another

XPM Configxpmir.rankers.DuoTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize, batcher, device)[source]

Bases: AbstractTwoStageRetriever

Submit type: xpmir.rankers.DuoTwoStageRetriever

The two stage retriever for pairwise scorers.

For pairwise scorer, we need to aggregate the pairwise scores in some way.

store: datamaestro_text.data.ir.DocumentStore

Give the document store associated with this retriever

retriever: xpmir.rankers.Retriever

The base retriever

scorer: xpmir.rankers.Scorer

The scorer used to re-rank the documents

top_k: int

The number of returned documents (if None, returns all the documents)

batchsize: int = 0

The batch size for the re-ranker

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher.XPMValue()

How to provide batches of documents

device: xpmir.learning.devices.Device

Device on which the model is run

XPM Configxpmir.rankers.DuoLearnableScorer[source]

Bases: LearnableScorer

Submit type: xpmir.rankers.DuoLearnableScorer

Base class for models that can score a triplet (query, document 1, document 2)

Misc

XPM Configxpmir.rankers.full.FullRetriever(*, store, documents)[source]

Bases: Retriever

Submit type: xpmir.rankers.full.FullRetriever

Retrieves all the documents of the collection

This can be used to build a small validation set on a subset of the collection - in that case, the scorer can be used through a TwoStageRetriever, with this retriever as the base retriever.

store: datamaestro_text.data.ir.DocumentStore

Give the document store associated with this retriever

documents: datamaestro_text.data.ir.Documents
XPM Configxpmir.rankers.full.FullRetrieverRescorer(*, store, documents, scorer, batchsize, batcher, device)[source]

Bases: Retriever

Submit type: xpmir.rankers.full.FullRetrieverRescorer

Scores all the documents from a collection

store: datamaestro_text.data.ir.DocumentStore

Give the document store associated with this retriever

documents: datamaestro_text.data.ir.Documents

The set of documents to consider

scorer: xpmir.neural.DualRepresentationScorer

The scorer (a dual representation scorer)

batchsize: int = 0
batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher.XPMValue()
device: xpmir.learning.devices.Device
XPM Configxpmir.rankers.RetrieverHydrator(*, store, retriever)[source]

Bases: Retriever

Submit type: xpmir.rankers.RetrieverHydrator

Hydrate retrieved results with document text

store: datamaestro_text.data.ir.DocumentStore

The store for document texts

retriever: xpmir.rankers.Retriever

The retriever to hydrate

XPM Configxpmir.rankers.mergers.SumRetriever(*, store, retrievers, weights)[source]

Bases: Retriever

Submit type: xpmir.rankers.mergers.SumRetriever

Combines the scores of various retrievers

store: datamaestro_text.data.ir.DocumentStore

Give the document store associated with this retriever

retrievers: List[xpmir.rankers.Retriever]

retrievers

weights: List[int]

The weights of the retrievers

Collection dependendant

Anserini

XPM Configxpmir.index.anserini.Index(*, id, count, path, storePositions, storeDocvectors, storeRaw, storeContents, stemmer)[source]

Bases: AdhocIndex

Submit type: xpmir.index.anserini.Index

Anserini-backed index

id: str

The unique dataset ID

count: int

Number of documents

path: Path

Path to the index

storePositions: bool = False

Store term positions

storeDocvectors: bool = False

Store document term vectors

storeRaw: bool = False

Store raw document

storeContents: bool = False

Store processed documents (e.g. without HTML tags)

stemmer: str = porter

The stemmer to use

XPM Configxpmir.interfaces.anserini.AnseriniRetriever(*, store, index, model, k)[source]

Bases: Retriever

Submit type: xpmir.interfaces.anserini.AnseriniRetriever

An Anserini-based retriever

store: datamaestro_text.data.ir.DocumentStore

Give the document store associated with this retriever

index: xpmir.index.anserini.Index

The Anserini index

model: xpmir.rankers.standard.Model

the model used to search. Only supports BM25 so far.

k: int = 1500

Number of results to retrieve

XPM Taskxpmir.interfaces.anserini.IndexCollection(*, id, count, storePositions, storeDocvectors, storeRaw, storeContents, stemmer, threads, documents, thread)[source]

Bases: Index, Task

Submit type: xpmir.interfaces.anserini.IndexCollection

An [Anserini](https://github.com/castorini/anserini) index

id: str

Use an empty ID since identifier is determined by documents

count: int

Number of documents

path: Pathgenerated
storePositions: bool = False

Store term positions

storeDocvectors: bool = False

Store document term vectors

storeRaw: bool = False

Store raw document

storeContents: bool = False

Store processed documents (e.g. without HTML tags)

stemmer: str = porter

The stemmer to use

threads: int = 8
documents: datamaestro_text.data.ir.Documents

The documents to index

thread: int = 8

Number of threads when indexing

XPM Taskxpmir.interfaces.anserini.SearchCollection(*, model, topics, index)[source]

Bases: Task

Submit type: xpmir.interfaces.anserini.SearchCollection

path: Pathgenerated
model: xpmir.rankers.standard.Model
topics: datamaestro_text.data.ir.Topics
index: xpmir.index.anserini.Index

FAISS

XPM Configxpmir.index.faiss.FaissIndex(*, normalize, documents)[source]

Bases: Config

Submit type: xpmir.index.faiss.FaissIndex

FAISS Index

normalize: bool

Whether vectors should be normalized (L2)

faiss_index: Pathgenerated

Path to the file containing the index

documents: datamaestro_text.data.ir.DocumentStore

The set of documents

XPM Taskxpmir.index.faiss.IndexBackedFaiss(*, normalize, documents, encoder, batchsize, device, batcher, hooks, indexspec, sampler)[source]

Bases: FaissIndex, Task

Submit type: xpmir.index.faiss.IndexBackedFaiss

Constructs a FAISS index backed up by an index

During executions, InitializationHooks are used (pre/post)

normalize: bool

Whether vectors should be normalized (L2)

faiss_index: Pathgenerated

Path to the file containing the index

documents: datamaestro_text.data.ir.DocumentStore

The set of documents

encoder: xpmir.text.encoders.TextEncoder

Encoder for document texts

batchsize: int = 1

The batch size used when computing representations of documents

device: xpmir.learning.devices.Device = xpmir.learning.devices.Device.XPMValue()

The device used by the encoder

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher.XPMValue()

The way to prepare batches of documents

hooks: List[xpmir.context.Hook] = []

An optional list of hooks

indexspec: str

The index type as a factory string See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes for the full list of indices and https://github.com/facebookresearch/faiss/wiki/The-index-factory for the combination of the index factory

sampler: xpmir.documents.samplers.DocumentSampler

Optional document sampler when training the index – by default, all the documents from the collection are used

XPM Configxpmir.index.faiss.FaissRetriever(*, store, encoder, index, topk)[source]

Bases: Retriever

Submit type: xpmir.index.faiss.FaissRetriever

Retriever based on Faiss

store: datamaestro_text.data.ir.DocumentStore

Give the document store associated with this retriever

encoder: xpmir.text.encoders.TextEncoder

The query encoder

index: xpmir.index.faiss.FaissIndex

The faiss index

topk: int

the number of documents to be retrieved

Sparse

XPM Configxpmir.index.sparse.SparseRetriever(*, store, index, encoder, topk, device, batcher, batchsize, in_memory)[source]

Bases: Retriever, Generic[InputType]

Submit type: xpmir.index.sparse.SparseRetriever

store: datamaestro_text.data.ir.DocumentStore

Give the document store associated with this retriever

index: xpmir.index.sparse.SparseRetrieverIndex
encoder: xpmir.text.encoders.TextEncoderBase[InputType, torch.Tensor]
topk: int
device: xpmir.learning.devices.Device = xpmir.learning.devices.Device.XPMValue()

The device for building the index

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher.XPMValue()

The way to prepare batches of queries (when using retrieve_all)

batchsize: int

Size of batches (when using retrieve_all)

in_memory: bool = False

Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

XPM Configxpmir.index.sparse.SparseRetrieverIndex(*, index_path, documents)[source]

Bases: Config

Submit type: xpmir.index.sparse.SparseRetrieverIndex

index_path: Path
documents: datamaestro_text.data.ir.DocumentStore
XPM Taskxpmir.index.sparse.SparseRetrieverIndexBuilder(*, documents, encoder, batcher, batch_size, ordered_index, device, max_postings, in_memory, max_docs)[source]

Bases: Task, Generic[InputType]

Submit type: Any

Builds an index from a sparse representation

Assumes that document and queries have the same dimension, and that the score is computed through an inner product

documents: datamaestro_text.data.ir.DocumentStore

Set of documents to index

encoder: xpmir.text.encoders.TextEncoderBase[InputType, xpmir.text.encoders.TextsRepresentationOutput]

The encoder

batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher.XPMValue()

Batcher used when computing representations

batch_size: int

Size of batches

ordered_index: bool

Ordered index: if not ordered, use DAAT strategy (WAND), otherwise, use fast top-k strategies

device: xpmir.learning.devices.Device = xpmir.learning.devices.Device.XPMValue()

The device for building the index

max_postings: int = 16384

Maximum number of postings (per term) before flushing to disk

index_path: Pathgenerated
in_memory: bool = False

Whether the index should be fully loaded in memory (otherwise, uses virtual memory)

version: int = 3constant

Version 3 of the index

max_docs: int = 0

Maximum number of indexed documents