Retrieval

Models

class xpmir.rankers.ScoredDocument(docid: Optional[str], score: float, content: Optional[str] = None)

Bases: object

XPM Configxpmir.rankers.Retriever

Bases: experimaestro.core.objects.Config

A retriever is a model to return top-scored documents given a query

collection()

Returns the document collection object

getindex() datamaestro_text.data.ir.AdhocIndex

Returns the associated index (if any)

retrieve(query: str, content=False) List[xpmir.rankers.ScoredDocument]

Retrieves a documents, returning a list sorted by decreasing score

if content is true, includes the document full text

retrieve_all(queries: Dict[str, str]) Dict[str, List[xpmir.rankers.ScoredDocument]]

Retrieves for a set of documents

By default, iterate using self.retrieve, but this leaves some room open for optimization

Parameters

queries – A dictionary where the key is the ID of the query, and the value is the text

Standard IR models

Standard IR models are definitions that can be used by a specific instance, like e.g. xpmir.interfaces.anserini.AnseriniRetriever

XPM Configxpmir.rankers.standard.Model

Bases: experimaestro.core.objects.Config

Base class for standard IR models

XPM Configxpmir.rankers.standard.BM25(*, k1, b)

Bases: xpmir.rankers.standard.Model

BM-25 model definition

k1: float = 0.9
b: float = 0.4

Other retrievers

In a re-ranking setting, one can use a two stage retriever to perform retrieval, by using a fully fledge retriever first, and then re-ranking the results.

XPM Configxpmir.rankers.TwoStageRetriever(*, retriever, scorer, batchsize, batcher, device)

Bases: xpmir.rankers.Retriever

Use on retriever to select the top-K documents which are the re-ranked given a scorer

retriever: xpmir.rankers.Retriever

The base retriever

scorer: xpmir.rankers.Scorer

The scorer used to re-rank the documents

batchsize: int = 0

The batch size for the re-ranker

batcher: xpmir.letor.batchers.Batcher = xpmir.letor.batchers.Batcher()
device: xpmir.letor.devices.Device

Anserini

XPM Configxpmir.interfaces.anserini.Index(*, id, count, path, storePositions, storeDocvectors, storeRaw, storeContents, stemmer)

Bases: datamaestro_text.data.ir.AdhocIndex

Anserini-backed index

id: str

The unique dataset ID

count: int

Number of documents

path: Path

Path to the index

storePositions: bool = False

Store term positions

storeDocvectors: bool = False

Store document term vectors

storeRaw: bool = False

Store raw document

storeContents: bool = False

Store processed documents (e.g. without HTML tags)

stemmer: str = porter

The stemmer to use

XPM Configxpmir.interfaces.anserini.AnseriniRetriever(*, index, model, k)

Bases: xpmir.rankers.Retriever

An Anserini-based retriever

index: xpmir.index.anserini.Index

The Anserini index

model: xpmir.rankers.standard.Model

the model used to search. Only suupports BM25 so far.

k: int = 1500

Number of results to retrieve