Neural models

Cross-Encoder

Models that rely on a joint representation of the query and the document.

XPM Configxpmir.neural.cross.CrossScorer(*, checkpoint, encoder)

Bases: xpmir.neural.TorchLearnableScorer

Query-Document Representation Classifier

Based on a query-document representation representation (e.g. BERT [CLS] token). AKA Cross-Encoder

checkpoint: Path

A checkpoint path from which the model should be loaded (or None otherwise)

encoder: xpmir.text.encoders.DualTextEncoder

Document (and query) encoder

Dense models

XPM Configxpmir.neural.DualRepresentationScorer(*, checkpoint)

Bases: xpmir.neural.TorchLearnableScorer

Neural scorer based on (at least a partially) independant representation of the document and the question.

This is the base class for all scorers that depend on a map of cosine/inner products between query and document tokens.

checkpoint: Path

A checkpoint path from which the model should be loaded (or None otherwise)

score_pairs(queries, documents, info: Optional[xpmir.letor.context.TrainerContext]) <Mock name='mock.Tensor' id='139719205112128'>

Score the specified pairs of queries/documents.

There are as many queries as documents. The exact type of queries and documents depends on the specific instance of the dual representation scorer.

Parameters
  • queries (Any) – The list of encoded queries

  • documents (Any) – The matching list of encoded documents

  • info (Optional[TrainerContext]) – _description_

Returns

A tensor of dimension (N) where N is the number of documents/queries

Return type

torch.Tensor

score_product(queries, documents, info: Optional[xpmir.letor.context.TrainerContext])

Computes the score of all possible pairs of query and document

Parameters
  • queries (Any) – The encoded queries

  • documents (Any) – The encoded documents

  • info (Optional[TrainerContext]) – The training context (if learning)

Returns

A tensor of dimension (N, P) where N is the number of queries and P the number of documents

Return type

torch.Tensor

XPM Configxpmir.neural.dual.Dense(*, checkpoint, encoder, query_encoder)

Bases: xpmir.neural.dual.DualVectorScorer

A scorer based on a pair of (query, document) dense vectors

checkpoint: Path

A checkpoint path from which the model should be loaded (or None otherwise)

encoder: xpmir.text.encoders.TextEncoder

The document (and potentially query) encoder

query_encoder: xpmir.text.encoders.TextEncoder

The query encoder (optional, if not defined uses the query_encoder)

XPM Configxpmir.neural.dual.DotDense(*, checkpoint, encoder, query_encoder)

Bases: xpmir.neural.dual.Dense

Dual model based on inner product.

checkpoint: Path

A checkpoint path from which the model should be loaded (or None otherwise)

encoder: xpmir.text.encoders.TextEncoder

The document (and potentially query) encoder

query_encoder: xpmir.text.encoders.TextEncoder

The query encoder (optional, if not defined uses the query_encoder)

XPM Configxpmir.neural.dual.CosineDense(*, checkpoint, encoder, query_encoder)

Bases: xpmir.neural.dual.Dense

Dual model based on cosine similarity.

checkpoint: Path

A checkpoint path from which the model should be loaded (or None otherwise)

encoder: xpmir.text.encoders.TextEncoder

The document (and potentially query) encoder

query_encoder: xpmir.text.encoders.TextEncoder

The query encoder (optional, if not defined uses the query_encoder)

Interaction models

xpmir.neural.interaction.InteractionScorer

Interaction-based neural scorer

xpmir.neural.interaction.drmm.Drmm

Deep Relevance Matching Model (DRMM)

xpmir.neural.colbert.Colbert

ColBERT model

XPM Configxpmir.neural.interaction.InteractionScorer(*, checkpoint, vocab, qlen, dlen)

Bases: xpmir.neural.TorchLearnableScorer

Interaction-based neural scorer

This is the base class for all scorers that depend on a map of cosine/inner products between query and document token representations.

checkpoint: Path

A checkpoint path from which the model should be loaded (or None otherwise)

vocab: xpmir.text.Vocab

The embedding model – the vocab also defines how to tokenize text

qlen: int = 20

Maximum query length (this can be even shortened by the model)

dlen: int = 2000

Maximum document length (this can be even shortened by the model)

XPM Configxpmir.neural.interaction.drmm.Drmm(*, checkpoint, vocab, qlen, dlen, hist, hidden, index, combine)

Bases: xpmir.neural.interaction.InteractionScorer

Deep Relevance Matching Model (DRMM)

Implementation of the DRMM model from:

Jiafeng Guo, Yixing Fan, Qingyao Ai, and William Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM.

checkpoint: Path

A checkpoint path from which the model should be loaded (or None otherwise)

vocab: xpmir.text.Vocab

The embedding model – the vocab also defines how to tokenize text

qlen: int = 20

Maximum query length (this can be even shortened by the model)

dlen: int = 2000

Maximum document length (this can be even shortened by the model)

hist: xpmir.neural.interaction.drmm.CountHistogram = xpmir.neural.interaction.drmm.LogCountHistogram(nbins=29)

The histogram type

hidden: int = 5

Hidden layer dimension for the feed forward matching network

index: datamaestro_text.data.ir.AdhocIndex

The index (only used when using IDF to combine)

combine: xpmir.neural.interaction.drmm.Combination = xpmir.neural.interaction.drmm.IdfCombination()

How to combine the query term scores

XPM Configxpmir.neural.colbert.Colbert(*, checkpoint, vocab, qlen, dlen, masktoken, querytoken, doctoken, similarity, linear_dim, compression_size)

Bases: xpmir.neural.interaction.InteractionScorer

ColBERT model

Implementation of the Colbert model from:

Khattab, Omar, and Matei Zaharia. “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR 2020, Xi’An, China

For the standard Colbert model, use BERT as the vocab(ulary)

checkpoint: Path

A checkpoint path from which the model should be loaded (or None otherwise)

vocab: xpmir.text.Vocab

The embedding model – the vocab also defines how to tokenize text

qlen: int = 20

Maximum query length (this can be even shortened by the model)

dlen: int = 2000

Maximum document length (this can be even shortened by the model)

version: int = 2constant

Current version of the code (changes when a bug is found)

masktoken: bool = True

Whether a [MASK] token should be used instead of padding

querytoken: bool = True

Whether a specific query token should be used as a prefix to the question

doctoken: bool = True

Whether a specific document token should be used as a prefix to the document

similarity: xpmir.neural.colbert.Similarity = xpmir.neural.colbert.CosineDistance()

Which similarity to use

linear_dim: int = 128

Size of the last linear layer (before computing inner products)

compression_size: int = 128

Projection layer for the last layer (or 0 if None)

Sparse Models

XPM Configxpmir.neural.splade.SpladeTextEncoder(*, encoder, aggregation, maxlen)

Bases: xpmir.text.encoders.TextEncoder

Splade model

It is only a text encoder since the we use xpmir.neural.dual.DotDense

encoder: xpmir.text.huggingface.TransformerVocab

The encoder from Hugging Face

aggregation: xpmir.neural.splade.Aggregation

How to aggregate the vectors

maxlen: int

Max length for texts

XPM Configxpmir.neural.splade.Aggregation

Bases: experimaestro.core.objects.Config

The aggregation function for Splade

XPM Configxpmir.neural.splade.MaxAggregation

Bases: xpmir.neural.splade.Aggregation

Aggregate using a max

XPM Configxpmir.neural.splade.SumAggregation

Bases: xpmir.neural.splade.Aggregation

Aggregate using a sum

Pretrained models

xpmir.neural.pretrained.spladev2() xpmir.neural.dual.DotDense

The Splade V2 model (from https://github.com/naver/splade)

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval, Thibault Formal, Benjamin Piwowarski, Carlos Lassance, and Stéphane Clinchant.

https://arxiv.org/abs/2109.10086

xpmir.neural.pretrained.tas_balanced()

Returns the TAS-Balanced model (from huggingface)

> Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling > Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, Allan Hanbury

Returns

A DotDense ranker based on tas-balanced

Return type

DotDense