Neural models

Cross-Encoder

Models that rely on a joint representation of the query and the document.

XPM Configxpmir.neural.cross.CrossScorer(*, encoder)[source]

Bases: LearnableScorer, DistributableModel

Submit type: xpmir.neural.cross.CrossScorer

Query-Document Representation Classifier

Based on a query-document representation representation (e.g. BERT [CLS] token). AKA Cross-Encoder

encoder: xpmir.text.encoders.TextEncoderBase[Tuple[str, str], torch.Tensor]

an encoder for encoding the concatenated query-document tokens which doesn’t contains the final linear layer

XPM Configxpmir.neural.jointclassifier.JointClassifier(*, encoder)[source]

Bases: CrossScorer

Submit type: xpmir.neural.jointclassifier.JointClassifier

encoder: xpmir.text.encoders.TextEncoderBase[Tuple[str, str], torch.Tensor]

an encoder for encoding the concatenated query-document tokens which doesn’t contains the final linear layer

XPM Configxpmir.neural.cross.DuoCrossScorer(*, encoder)[source]

Bases: DuoLearnableScorer, DistributableModel

Submit type: xpmir.neural.cross.DuoCrossScorer

Preference based classifier

This scorer can be used to train a DuoBERT-type model.

encoder: xpmir.text.encoders.TripletTextEncoder

The encoder to use for the Duobert model

Dual models

Dual models compute a separate representation for documents and queries, which allows some speedup when computing scores of several documents and/or queries.

XPM Configxpmir.neural.DualRepresentationScorer[source]

Bases: LearnableScorer, Generic[QueriesRep, DocsRep]

Submit type: xpmir.neural.DualRepresentationScorer

Neural scorer based on (at least a partially) independent representation of the document and the question.

This is the base class for all scorers that depend on a map of cosine/inner products between query and document tokens.

abstract score_pairs(queries: QueriesRep, documents: DocsRep, info: TrainerContext | None = None) torch.Tensor[source]

Score the specified pairs of queries/documents.

There are as many queries as documents. The exact type of queries and documents depends on the specific instance of the dual representation scorer.

Parameters:
  • queries (QueriesRep) – The list of encoded queries

  • documents (DocsRep) – The matching list of encoded documents

  • info (Optional[TrainerContext]) – _description_

Returns:

A tensor of dimension (N) where N is the number of documents/queries

Return type:

torch.Tensor

abstract score_product(queries: QueriesRep, documents: DocsRep, info: TrainerContext | None = None) torch.Tensor[source]

Computes the score of all possible pairs of query and document

Parameters:
  • queries (Any) – The encoded queries

  • documents (Any) – The encoded documents

  • info (Optional[TrainerContext]) – The training context (if learning)

Returns:

A tensor of dimension (N, P) where N is the number of queries and P the number of documents

Return type:

torch.Tensor

XPM Configxpmir.neural.dual.DualVectorScorer(*, encoder, query_encoder)[source]

Bases: DualRepresentationScorer[QueriesRep, DocsRep]

Submit type: xpmir.neural.dual.DualVectorScorer

A scorer based on dual vectorial representations

encoder: xpmir.text.encoders.TextEncoderBase

The document (and potentially query) encoder

query_encoder: xpmir.text.encoders.TextEncoderBase

The query encoder (optional, if not defined uses the query_encoder)

Hooks

XPM Configxpmir.neural.dual.DualVectorListener[source]

Bases: TrainingHook

Submit type: xpmir.neural.dual.DualVectorListener

Listener called with the (vectorial) representation of queries and documents

The hook is called just after the computation of documents and queries representations.

This can be used for logging purposes, but more importantly, to add regularization losses such as the FlopsRegularizer regularizer.

__call__(context: TrainerContext, queries: torch.Tensor, documents: torch.Tensor)[source]

Hook handler

Parameters:
  • context (TrainerContext) – The training context

  • queries (torch.Tensor) – The query vectors

  • documents (torch.Tensor) – The document vectors

Raises:

NotImplementedError – _description_

XPM Configxpmir.neural.dual.FlopsRegularizer(*, lambda_q, lambda_d)[source]

Bases: DualVectorListener

Submit type: xpmir.neural.dual.FlopsRegularizer

The FLOPS regularizer computes

\[FLOPS(q,d) = \lambda_q FLOPS(q) + \lambda_d FLOPS(d)\]

where

\[FLOPS(x) = \left( \frac{1}{d} \sum_{i=1}^d |x_i| \right)^2\]
lambda_q: float

Lambda for queries

lambda_d: float

Lambda for documents

XPM Configxpmir.neural.dual.ScheduledFlopsRegularizer(*, lambda_q, lambda_d, min_lambda_q, min_lambda_d, lambda_warmup_steps)[source]

Bases: FlopsRegularizer

Submit type: xpmir.neural.dual.ScheduledFlopsRegularizer

The FLOPS regularizer where the lamdba_q and lambda_d varie according to the steps. The lambda values goes quadratic before the `lambda_warmup_steps`, and then remains constant

lambda_q: float

Lambda for queries

lambda_d: float

Lambda for documents

min_lambda_q: float = 0

Min value for the lambda_q before it increase

min_lambda_d: float = 0

Min value for the lambda_d before it increase

lambda_warmup_steps: int = 0

The warmup steps for the lambda

Dense models

XPM Configxpmir.neural.dual.Dense(*, encoder, query_encoder)[source]

Bases: DualVectorScorer[QueriesRep, DocsRep]

Submit type: xpmir.neural.dual.Dense

A scorer based on a pair of (query, document) dense vectors

encoder: xpmir.text.encoders.TextEncoderBase

The document (and potentially query) encoder

query_encoder: xpmir.text.encoders.TextEncoderBase

The query encoder (optional, if not defined uses the query_encoder)

classmethod from_sentence_transformers(hf_id: str, **kwargs)[source]

Creates a dense model from a Sentence transformer

The list can be found on HuggingFace https://huggingface.co/models?library=sentence-transformers

Parameters:

hf_id – The HuggingFace ID

XPM Configxpmir.neural.dual.DotDense(*, encoder, query_encoder)[source]

Bases: Dense, DistributableModel

Submit type: xpmir.neural.dual.DotDense

Dual model based on inner product.

encoder: xpmir.text.encoders.TextEncoderBase

The document (and potentially query) encoder

query_encoder: xpmir.text.encoders.TextEncoderBase

The query encoder (optional, if not defined uses the query_encoder)

XPM Configxpmir.neural.dual.CosineDense(*, encoder, query_encoder)[source]

Bases: Dense

Submit type: xpmir.neural.dual.CosineDense

Dual model based on cosine similarity.

encoder: xpmir.text.encoders.TextEncoderBase

The document (and potentially query) encoder

query_encoder: xpmir.text.encoders.TextEncoderBase

The query encoder (optional, if not defined uses the query_encoder)

Interaction models

xpmir.neural.interaction.InteractionScorer

Interaction-based neural scorer

xpmir.neural.interaction.drmm.Drmm

Deep Relevance Matching Model (DRMM)

xpmir.neural.interaction.colbert.Colbert

ColBERT model

XPM Configxpmir.neural.interaction.InteractionScorer(*, encoder, query_encoder, similarity, qlen, dlen)[source]

Bases: DualVectorScorer[SimilarityInput, SimilarityInput]

Submit type: xpmir.neural.interaction.InteractionScorer

Interaction-based neural scorer

This is the base class for all scorers that depend on a map of cosine/inner products between query and document token representations.

encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]

The embedding model – the vocab also defines how to tokenize text

query_encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]

The embedding model for queries (if None, uses encoder)

similarity: xpmir.neural.interaction.common.Similarity

Which similarity function to use - ColBERT uses a cosine similarity by default

qlen: int = 20

Maximum query length (this can be even shortened by the model)

dlen: int = 2000

Maximum document length (this can be even shortened by the model)

XPM Configxpmir.neural.interaction.drmm.Drmm(*, encoder, query_encoder, similarity, qlen, dlen, hist, hidden, index, combine)[source]

Bases: InteractionScorer

Submit type: xpmir.neural.interaction.drmm.Drmm

Deep Relevance Matching Model (DRMM)

Implementation of the DRMM model from:

Jiafeng Guo, Yixing Fan, Qingyao Ai, and William Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM.

encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]

The embedding model – the vocab also defines how to tokenize text

query_encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]

The embedding model for queries (if None, uses encoder)

similarity: xpmir.neural.interaction.common.Similarity

Which similarity function to use - ColBERT uses a cosine similarity by default

qlen: int = 20

Maximum query length (this can be even shortened by the model)

dlen: int = 2000

Maximum document length (this can be even shortened by the model)

hist: xpmir.neural.interaction.drmm.CountHistogram = xpmir.neural.interaction.drmm.LogCountHistogram.XPMValue(nbins=29)

The histogram type

hidden: int = 5

Hidden layer dimension for the feed forward matching network

index: datamaestro_text.data.ir.AdhocIndex

The index (only used when using IDF to combine)

combine: xpmir.neural.interaction.drmm.Combination = xpmir.neural.interaction.drmm.IdfCombination.XPMValue()

How to combine the query term scores

XPM Configxpmir.neural.interaction.colbert.Colbert(*, encoder, query_encoder, similarity, qlen, dlen, linear_dim, compression_size)[source]

Bases: InteractionScorer

Submit type: xpmir.neural.interaction.colbert.Colbert

ColBERT model

Implementation of the Colbert model from:

Khattab, Omar, and Matei Zaharia. “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR 2020, Xi’An, China

For the standard Colbert model, use the colbert function

encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]

The embedding model – the vocab also defines how to tokenize text

query_encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]

The embedding model for queries (if None, uses encoder)

similarity: xpmir.neural.interaction.common.Similarity

Which similarity function to use - ColBERT uses a cosine similarity by default

qlen: int = 20

Maximum query length (this can be even shortened by the model)

dlen: int = 2000

Maximum document length (this can be even shortened by the model)

version: int = 2constant

Current version of the code (changes if a bug is found)

linear_dim: int = 128

Size of the last linear layer (before computing inner products)

compression_size: int = 128

Projection layer for the last layer (or 0 if None)

DRMM

XPM Configxpmir.neural.interaction.drmm.Combination[source]

Bases: Config, TorchModule

Submit type: xpmir.neural.interaction.drmm.Combination

XPM Configxpmir.neural.interaction.drmm.CountHistogram(*, nbins)[source]

Bases: Config, TorchModule

Submit type: xpmir.neural.interaction.drmm.CountHistogram

Base histogram class

nbins: int = 29

number of bins in matching histogram

XPM Configxpmir.neural.interaction.drmm.IdfCombination[source]

Bases: Combination

Submit type: xpmir.neural.interaction.drmm.IdfCombination

XPM Configxpmir.neural.interaction.drmm.LogCountHistogram(*, nbins)[source]

Bases: CountHistogram

Submit type: xpmir.neural.interaction.drmm.LogCountHistogram

nbins: int = 29

number of bins in matching histogram

XPM Configxpmir.neural.interaction.drmm.NormalizedHistogram(*, nbins)[source]

Bases: CountHistogram

Submit type: xpmir.neural.interaction.drmm.NormalizedHistogram

nbins: int = 29

number of bins in matching histogram

XPM Configxpmir.neural.interaction.drmm.SumCombination[source]

Bases: Combination

Submit type: xpmir.neural.interaction.drmm.SumCombination

Similarity

XPM Configxpmir.neural.interaction.common.Similarity[source]

Bases: Config, ABC

Submit type: xpmir.neural.interaction.common.Similarity

Base class for similarity between two texts representations (3D tensor batch x length x dim)

XPM Configxpmir.neural.interaction.common.DotProductSimilarity[source]

Bases: Similarity

Submit type: xpmir.neural.interaction.common.DotProductSimilarity

XPM Configxpmir.neural.interaction.common.CosineSimilarity[source]

Bases: DotProductSimilarity

Submit type: xpmir.neural.interaction.common.CosineSimilarity

Cosine similarity between two texts representations (3D tensor batch x length x dim)

class xpmir.neural.interaction.common.SimilarityInput(value: torch.Tensor, mask: torch.BoolTensor)[source]

Bases: Sequence[SimilarityInput]

class xpmir.neural.interaction.common.SimilarityOutput(similarity: torch.Tensor)[source]

Bases: ABC

Output for token similarities

Sparse Models

XPM Configxpmir.neural.splade.SpladeTextEncoder(*, encoder, aggregation, maxlen)[source]

Bases: TextEncoder, DistributableModel

Submit type: xpmir.neural.splade.SpladeTextEncoder

Splade model

It is only a text encoder since the we use xpmir.neural.dual.DotDense as the scorer class

encoder: xpmir.text.huggingface.TransformerTokensEncoderWithMLMOutput

The encoder from Hugging Face

aggregation: xpmir.neural.splade.Aggregation

How to aggregate the vectors

maxlen: int

Max length for texts

XPM Configxpmir.neural.splade.SpladeTextEncoderV2(*, tokenizer, encoder, aggregation, maxlen)[source]

Bases: TextEncoderBase[InputType, TextsRepresentationOutput], DistributableModel, Generic[InputType]

Submit type: xpmir.neural.splade.SpladeTextEncoderV2

tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizerBase[InputType]

The tokenizer from Hugging Face

encoder: xpmir.text.huggingface.base.HFMaskedLanguageModel

The encoder from Hugging Face

aggregation: xpmir.neural.splade.Aggregation

How to aggregate the vectors

maxlen: int

Max length for texts

XPM Configxpmir.neural.splade.Aggregation[source]

Bases: Config

Submit type: xpmir.neural.splade.Aggregation

The aggregation function for Splade

XPM Configxpmir.neural.splade.MaxAggregation[source]

Bases: Aggregation

Submit type: xpmir.neural.splade.MaxAggregation

Aggregate using a max

XPM Configxpmir.neural.splade.SumAggregation[source]

Bases: Aggregation

Submit type: xpmir.neural.splade.SumAggregation

Aggregate using a sum

Generative Models

XPM Configxpmir.neural.generative.ConditionalGenerator[source]

Bases: Module

Submit type: xpmir.neural.generative.ConditionalGenerator

Models that generate an identifier given a document or a query

XPM Configxpmir.neural.generative.cross.GenerativeCrossScorer(*, pattern, generator, relevant_token_id)[source]

Bases: LearnableScorer

Submit type: xpmir.neural.generative.cross.GenerativeCrossScorer

A cross-encoder based on a generative model

version: int = 2constant

Generative cross scorer version changelog: 1. corrects output type probability

pattern: str = Query: {query} Document: {document} Relevant:
generator: xpmir.neural.generative.ConditionalGenerator
relevant_token_id: int

HuggingFace Generative Models

XPM Configxpmir.neural.generative.hf.LoadFromT5(*, t5_model)[source]

Bases: LightweightTask

Submit type: xpmir.neural.generative.hf.LoadFromT5

Load parameters from a T5 model

t5_model: xpmir.neural.generative.hf.T5ConditionalGenerator

the target

XPM Configxpmir.neural.generative.hf.T5IdentifierGenerator(*, hf_id, decoder_outdim)[source]

Bases: T5ConditionalGenerator

Submit type: xpmir.neural.generative.hf.T5IdentifierGenerator

generate the id of the token based on t5-based models

hf_id: str

The HuggingFace identifier (to configure the model)

decoder_outdim: int = 10

The decoder output dimension for the t5 model, use it to rebuild the lm_head and the decoder embedding, this number doesn’t include the pad token and the eos token

XPM Configxpmir.neural.generative.hf.T5ConditionalGenerator(*, hf_id)[source]

Bases: ConditionalGenerator, DistributableModel

Submit type: xpmir.neural.generative.hf.T5ConditionalGenerator

hf_id: str

The HuggingFace identifier (to configure the model)

XPM Configxpmir.neural.generative.hf.T5CustomOutputGenerator(*, hf_id, tokens)[source]

Bases: T5ConditionalGenerator

Submit type: xpmir.neural.generative.hf.T5CustomOutputGenerator

generate the id of the token based on t5-based models

hf_id: str

The HuggingFace identifier (to configure the model)

tokens: List[str]

From Huggingface

XPM Configxpmir.neural.huggingface.HFCrossScorer(*, hf_id, max_length)[source]

Bases: LearnableScorer, DistributableModel

Submit type: xpmir.neural.huggingface.HFCrossScorer

Load a cross scorer model from the huggingface

hf_id: str

the id for the huggingface model

max_length: int

the max length for the transformer model