Retrieval
This page describes the different configurations/tasks needed for retrieval, i.e. searching the a subset of \(k\) documents given a query.
Base class shows the main class used for retrieval,
Standard IR models describes the configurations for standard IR models like BM25,
Multi-stage retrievers describes the configurations handling multi-stage retrieval (e.g. two-stage retriever)
`Factories`_ describe utility classes and decorators that can be used to build retrievers that depend on a dataset.
Finally, retrieval interfaces to other libraries are given for Anserini, FAISS.
Base class
- class xpmir.rankers.ScoredDocument(document: Document, score: float)[source]
Bases:
object
A data structure that associated a score with a document
- XPM Configxpmir.rankers.Retriever(*, store)[source]
Bases:
Config
A retriever is a model to return top-scored documents given a query
- store: datamaestro_text.data.ir.DocumentStore
Give the document store associated with this retriever
- retrieve(query: str) List[ScoredDocument] [source]
Retrieves documents, returning a list sorted by decreasing score
if content is true, includes the document full text
- retrieve_all(queries: Dict[str, str]) Dict[str, List[ScoredDocument]] [source]
Retrieves for a set of documents
By default, iterate using self.retrieve, but this leaves some room open for optimization
- Parameters:
queries – A dictionary where the key is the ID of the query, and the value is the text
Standard IR models
Standard IR models are definitions that can be used by a specific instance,
like e.g. xpmir.interfaces.anserini.AnseriniRetriever
Multi-stage retrievers
In a re-ranking setting, one can use a two stage retriever to perform retrieval, by using a fully fledge retriever first, and then re-ranking the results.
- XPM Configxpmir.rankers.AbstractTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize, batcher, device)[source]
Bases:
Retriever
Abstract class for all two stage retrievers (i.e. scorers and duo-scorers)
- store: datamaestro_text.data.ir.DocumentStore
Give the document store associated with this retriever
- retriever: xpmir.rankers.Retriever
The base retriever
- scorer: xpmir.rankers.Scorer
The scorer used to re-rank the documents
- top_k: int
The number of returned documents (if None, returns all the documents)
- batchsize: int = 0
The batch size for the re-ranker
- batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher()
How to provide batches of documents
- device: xpmir.learning.devices.Device
Device on which the model is run
- XPM Configxpmir.rankers.TwoStageRetriever(*, store, retriever, scorer, top_k, batchsize, batcher, device)[source]
Bases:
AbstractTwoStageRetriever
Use on retriever to select the top-K documents which are the re-ranked given a scorer
- store: datamaestro_text.data.ir.DocumentStore
Give the document store associated with this retriever
- retriever: xpmir.rankers.Retriever
The base retriever
- scorer: xpmir.rankers.Scorer
The scorer used to re-rank the documents
- top_k: int
The number of returned documents (if None, returns all the documents)
- batchsize: int = 0
The batch size for the re-ranker
- batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher()
How to provide batches of documents
- device: xpmir.learning.devices.Device
Device on which the model is run
Duo-retrievers
Duo-retrievers only predicts whether a document is “more relevant” than another
- XPM Configxpmir.rankers.DuoTwoStageRetriever(*, store, retriever, scorer, top_k, batchsize, batcher, device)[source]
Bases:
AbstractTwoStageRetriever
The two stage retriever for pairwise scorers.
For pairwise scorer, we need to aggregate the pairwise scores in some way.
- store: datamaestro_text.data.ir.DocumentStore
Give the document store associated with this retriever
- retriever: xpmir.rankers.Retriever
The base retriever
- scorer: xpmir.rankers.Scorer
The scorer used to re-rank the documents
- top_k: int
The number of returned documents (if None, returns all the documents)
- batchsize: int = 0
The batch size for the re-ranker
- batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher()
How to provide batches of documents
- device: xpmir.learning.devices.Device
Device on which the model is run
- XPM Configxpmir.rankers.DuoLearnableScorer[source]
Bases:
LearnableScorer
Base class for models that can score a triplet (query, document 1, document 2)
Misc
- XPM Configxpmir.rankers.full.FullRetriever(*, store, documents)[source]
Bases:
Retriever
Retrieves all the documents of the collection
This can be used to build a small validation set on a subset of the collection - in that case, the scorer can be used through a TwoStageRetriever, with this retriever as the base retriever.
- store: datamaestro_text.data.ir.DocumentStore
Give the document store associated with this retriever
- documents: datamaestro_text.data.ir.Documents
- XPM Configxpmir.rankers.full.FullRetrieverRescorer(*, store, documents, scorer, batchsize, batcher, device)[source]
Bases:
Retriever
Scores all the documents from a collection
- store: datamaestro_text.data.ir.DocumentStore
Give the document store associated with this retriever
- documents: datamaestro_text.data.ir.Documents
The set of documents to consider
- scorer: xpmir.neural.DualRepresentationScorer
The scorer (a dual representation scorer)
- batchsize: int = 0
- batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher()
- device: xpmir.learning.devices.Device
- XPM Configxpmir.rankers.RetrieverHydrator(*, store, retriever)[source]
Bases:
Retriever
Hydrate retrieved results with document text
- store: datamaestro_text.data.ir.DocumentStore
The store for document texts
- retriever: xpmir.rankers.Retriever
The retriever to hydrate
- XPM Configxpmir.rankers.mergers.SumRetriever(*, store, retrievers, weights)[source]
Bases:
Retriever
Combines the scores of various retrievers
- store: datamaestro_text.data.ir.DocumentStore
Give the document store associated with this retriever
- retrievers: List[xpmir.rankers.Retriever]
retrievers
- weights: List[int]
The weights of the retrievers
Collection dependendant
Anserini
- XPM Configxpmir.index.anserini.Index(*, id, count, path, storePositions, storeDocvectors, storeRaw, storeContents, stemmer)[source]
Bases:
AdhocIndex
Anserini-backed index
- id: str
The unique dataset ID
- count: int
Number of documents
- path: Path
Path to the index
- storePositions: bool = False
Store term positions
- storeDocvectors: bool = False
Store document term vectors
- storeRaw: bool = False
Store raw document
- storeContents: bool = False
Store processed documents (e.g. without HTML tags)
- stemmer: str = porter
The stemmer to use
- XPM Configxpmir.interfaces.anserini.AnseriniRetriever(*, store, index, model, k)[source]
Bases:
Retriever
An Anserini-based retriever
- store: datamaestro_text.data.ir.DocumentStore
Give the document store associated with this retriever
- index: xpmir.index.anserini.Index
The Anserini index
- model: xpmir.rankers.standard.Model
the model used to search. Only suupports BM25 so far.
- k: int = 1500
Number of results to retrieve
- XPM Taskxpmir.interfaces.anserini.IndexCollection(*, id, count, storePositions, storeDocvectors, storeRaw, storeContents, stemmer, threads, documents, thread)[source]
Bases:
Index
,Task
An [Anserini](https://github.com/castorini/anserini) index
- id: str
Use an empty ID since identifier is determined by documents
- count: int
Number of documents
- path: Pathgenerated
- storePositions: bool = False
Store term positions
- storeDocvectors: bool = False
Store document term vectors
- storeRaw: bool = False
Store raw document
- storeContents: bool = False
Store processed documents (e.g. without HTML tags)
- stemmer: str = porter
The stemmer to use
- threads: int = 8
- documents: datamaestro_text.data.ir.Documents
The documents to index
- thread: int = 8
Number of threads when indexing
- XPM Taskxpmir.interfaces.anserini.SearchCollection(*, model, topics, index)[source]
Bases:
Task
- path: Pathgenerated
- topics: datamaestro_text.data.ir.Topics
- index: xpmir.index.anserini.Index
FAISS
- XPM Configxpmir.index.faiss.FaissIndex(*, normalize, documents)[source]
Bases:
Config
FAISS Index
- normalize: bool
Whether vectors should be normalized (L2)
- faiss_index: Pathgenerated
Path to the file containing the index
- documents: datamaestro_text.data.ir.DocumentStore
The set of documents
- XPM Taskxpmir.index.faiss.IndexBackedFaiss(*, normalize, documents, encoder, batchsize, device, batcher, hooks, indexspec, sampler)[source]
Bases:
FaissIndex
,Task
Constructs a FAISS index backed up by an index
During executions, InitializationHooks are used (pre/post)
- normalize: bool
Whether vectors should be normalized (L2)
- faiss_index: Pathgenerated
Path to the file containing the index
- documents: datamaestro_text.data.ir.DocumentStore
The set of documents
- encoder: xpmir.text.encoders.TextEncoder
Encoder for document texts
- batchsize: int = 1
The batch size used when computing representations of documents
- device: xpmir.learning.devices.Device = xpmir.learning.devices.Device()
The device used by the encoder
- batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher()
The way to prepare batches of documents
- hooks: List[xpmir.context.Hook] = []
An optional list of hooks
- indexspec: str
The index type as a factory string See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes for the full list of indices and https://github.com/facebookresearch/faiss/wiki/The-index-factory for the combination of the index factory
- sampler: xpmir.documents.samplers.DocumentSampler
Optional document sampler when training the index – by default, all the documents from the collection are used
- XPM Configxpmir.index.faiss.FaissRetriever(*, store, encoder, index, topk)[source]
Bases:
Retriever
Retriever based on Faiss
- store: datamaestro_text.data.ir.DocumentStore
Give the document store associated with this retriever
- encoder: xpmir.text.encoders.TextEncoder
The query encoder
- index: xpmir.index.faiss.FaissIndex
The faiss index
- topk: int
the number of documents to be retrieved
Sparse
- XPM Configxpmir.index.sparse.SparseRetriever(*, store, index, encoder, topk, batcher, batchsize, in_memory)[source]
Bases:
Retriever
- store: datamaestro_text.data.ir.DocumentStore
Give the document store associated with this retriever
- encoder: xpmir.text.encoders.TextEncoder
- topk: int
- batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher()
The way to prepare batches of queries (when using retrieve_all)
- batchsize: int
Size of batches (when using retrieve_all)
- in_memory: bool = False
Whether the index should be fully loaded in memory (otherwise, uses virtual memory)
- XPM Configxpmir.index.sparse.SparseRetrieverIndex(*, index_path, documents)[source]
Bases:
Config
- index_path: Path
- documents: datamaestro_text.data.ir.DocumentStore
- XPM Taskxpmir.index.sparse.SparseRetrieverIndexBuilder(*, documents, encoder, batcher, batch_size, ordered_index, device, max_postings, in_memory, max_docs)[source]
Bases:
Task
Submit type:
Any
Builds an index from a sparse representation
Assumes that document and queries have the same dimension, and that the score is computed through an inner product
- documents: datamaestro_text.data.ir.DocumentStore
Set of documents to index
- encoder: xpmir.text.encoders.TextEncoder
The encoder
- batcher: xpmir.learning.batchers.Batcher = xpmir.learning.batchers.Batcher()
Batcher used when computing representations
- batch_size: int
Size of batches
- ordered_index: bool
Ordered index: if not ordered, use DAAT strategy (WAND), otherwise, use fast top-k strategies
- device: xpmir.learning.devices.Device = xpmir.learning.devices.Device()
- max_postings: int = 16384
Maximum number of postings (per term) before flushing to disk
- index_path: Pathgenerated
- in_memory: bool = False
Whether the index should be fully loaded in memory (otherwise, uses virtual memory)
- version: int = 3constant
Version 3 of the index
- max_docs: int = 0
Maximum number of indexed documents