Neural models
Cross-Encoder
Models that rely on a joint representation of the query and the document.
- XPM Configxpmir.neural.cross.CrossScorer(*, encoder)[source]
Bases:
LearnableScorer
,DistributableModel
Submit type:
xpmir.neural.cross.CrossScorer
Query-Document Representation Classifier
Based on a query-document representation representation (e.g. BERT [CLS] token). AKA Cross-Encoder
- encoder: xpmir.text.encoders.TextEncoderBase[Tuple[str, str], torch.Tensor]
an encoder for encoding the concatenated query-document tokens which doesn’t contains the final linear layer
- XPM Configxpmir.neural.jointclassifier.JointClassifier(*, encoder)[source]
Bases:
CrossScorer
Submit type:
xpmir.neural.jointclassifier.JointClassifier
- encoder: xpmir.text.encoders.TextEncoderBase[Tuple[str, str], torch.Tensor]
an encoder for encoding the concatenated query-document tokens which doesn’t contains the final linear layer
- XPM Configxpmir.neural.cross.DuoCrossScorer(*, encoder)[source]
Bases:
DuoLearnableScorer
,DistributableModel
Submit type:
xpmir.neural.cross.DuoCrossScorer
Preference based classifier
This scorer can be used to train a DuoBERT-type model.
- encoder: xpmir.text.encoders.TripletTextEncoder
The encoder to use for the Duobert model
Dual models
Dual models compute a separate representation for documents and queries, which allows some speedup when computing scores of several documents and/or queries.
- XPM Configxpmir.neural.DualRepresentationScorer[source]
Bases:
LearnableScorer
,Generic
[QueriesRep
,DocsRep
]Submit type:
xpmir.neural.DualRepresentationScorer
Neural scorer based on (at least a partially) independent representation of the document and the question.
This is the base class for all scorers that depend on a map of cosine/inner products between query and document tokens.
- abstract score_pairs(queries: QueriesRep, documents: DocsRep, info: TrainerContext | None = None) torch.Tensor [source]
Score the specified pairs of queries/documents.
There are as many queries as documents. The exact type of queries and documents depends on the specific instance of the dual representation scorer.
- Parameters:
queries (QueriesRep) – The list of encoded queries
documents (DocsRep) – The matching list of encoded documents
info (Optional[TrainerContext]) – _description_
- Returns:
A tensor of dimension (N) where N is the number of documents/queries
- Return type:
torch.Tensor
- abstract score_product(queries: QueriesRep, documents: DocsRep, info: TrainerContext | None = None) torch.Tensor [source]
Computes the score of all possible pairs of query and document
- Parameters:
queries (Any) – The encoded queries
documents (Any) – The encoded documents
info (Optional[TrainerContext]) – The training context (if learning)
- Returns:
A tensor of dimension (N, P) where N is the number of queries and P the number of documents
- Return type:
torch.Tensor
- XPM Configxpmir.neural.dual.DualVectorScorer(*, encoder, query_encoder)[source]
Bases:
DualRepresentationScorer
[QueriesRep
,DocsRep
]Submit type:
xpmir.neural.dual.DualVectorScorer
A scorer based on dual vectorial representations
- encoder: xpmir.text.encoders.TextEncoderBase
The document (and potentially query) encoder
- query_encoder: xpmir.text.encoders.TextEncoderBase
The query encoder (optional, if not defined uses the query_encoder)
Hooks
- XPM Configxpmir.neural.dual.DualVectorListener[source]
Bases:
TrainingHook
Submit type:
xpmir.neural.dual.DualVectorListener
Listener called with the (vectorial) representation of queries and documents
The hook is called just after the computation of documents and queries representations.
This can be used for logging purposes, but more importantly, to add regularization losses such as the
FlopsRegularizer
regularizer.
- XPM Configxpmir.neural.dual.FlopsRegularizer(*, lambda_q, lambda_d)[source]
Bases:
DualVectorListener
Submit type:
xpmir.neural.dual.FlopsRegularizer
The FLOPS regularizer computes
\[FLOPS(q,d) = \lambda_q FLOPS(q) + \lambda_d FLOPS(d)\]where
\[FLOPS(x) = \left( \frac{1}{d} \sum_{i=1}^d |x_i| \right)^2\]- lambda_q: float
Lambda for queries
- lambda_d: float
Lambda for documents
- XPM Configxpmir.neural.dual.ScheduledFlopsRegularizer(*, lambda_q, lambda_d, min_lambda_q, min_lambda_d, lambda_warmup_steps)[source]
Bases:
FlopsRegularizer
Submit type:
xpmir.neural.dual.ScheduledFlopsRegularizer
The FLOPS regularizer where the lamdba_q and lambda_d varie according to the steps. The lambda values goes quadratic before the
`lambda_warmup_steps`
, and then remains constant- lambda_q: float
Lambda for queries
- lambda_d: float
Lambda for documents
- min_lambda_q: float = 0
Min value for the lambda_q before it increase
- min_lambda_d: float = 0
Min value for the lambda_d before it increase
- lambda_warmup_steps: int = 0
The warmup steps for the lambda
Dense models
- XPM Configxpmir.neural.dual.Dense(*, encoder, query_encoder)[source]
Bases:
DualVectorScorer
[QueriesRep
,DocsRep
]Submit type:
xpmir.neural.dual.Dense
A scorer based on a pair of (query, document) dense vectors
- encoder: xpmir.text.encoders.TextEncoderBase
The document (and potentially query) encoder
- query_encoder: xpmir.text.encoders.TextEncoderBase
The query encoder (optional, if not defined uses the query_encoder)
- classmethod from_sentence_transformers(hf_id: str, **kwargs)[source]
Creates a dense model from a Sentence transformer
The list can be found on HuggingFace https://huggingface.co/models?library=sentence-transformers
- Parameters:
hf_id – The HuggingFace ID
- XPM Configxpmir.neural.dual.DotDense(*, encoder, query_encoder)[source]
Bases:
Dense
,DistributableModel
Submit type:
xpmir.neural.dual.DotDense
Dual model based on inner product.
- encoder: xpmir.text.encoders.TextEncoderBase
The document (and potentially query) encoder
- query_encoder: xpmir.text.encoders.TextEncoderBase
The query encoder (optional, if not defined uses the query_encoder)
- XPM Configxpmir.neural.dual.CosineDense(*, encoder, query_encoder)[source]
Bases:
Dense
Submit type:
xpmir.neural.dual.CosineDense
Dual model based on cosine similarity.
- encoder: xpmir.text.encoders.TextEncoderBase
The document (and potentially query) encoder
- query_encoder: xpmir.text.encoders.TextEncoderBase
The query encoder (optional, if not defined uses the query_encoder)
Interaction models
Interaction-based neural scorer |
|
Deep Relevance Matching Model (DRMM) |
|
ColBERT model |
- XPM Configxpmir.neural.interaction.InteractionScorer(*, encoder, query_encoder, similarity, qlen, dlen)[source]
Bases:
DualVectorScorer
[SimilarityInput
,SimilarityInput
]Submit type:
xpmir.neural.interaction.InteractionScorer
Interaction-based neural scorer
This is the base class for all scorers that depend on a map of cosine/inner products between query and document token representations.
- encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]
The embedding model – the vocab also defines how to tokenize text
- query_encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]
The embedding model for queries (if None, uses encoder)
- similarity: xpmir.neural.interaction.common.Similarity
Which similarity function to use - ColBERT uses a cosine similarity by default
- qlen: int = 20
Maximum query length (this can be even shortened by the model)
- dlen: int = 2000
Maximum document length (this can be even shortened by the model)
- XPM Configxpmir.neural.interaction.drmm.Drmm(*, encoder, query_encoder, similarity, qlen, dlen, hist, hidden, index, combine)[source]
Bases:
InteractionScorer
Submit type:
xpmir.neural.interaction.drmm.Drmm
Deep Relevance Matching Model (DRMM)
Implementation of the DRMM model from:
Jiafeng Guo, Yixing Fan, Qingyao Ai, and William Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM.
- encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]
The embedding model – the vocab also defines how to tokenize text
- query_encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]
The embedding model for queries (if None, uses encoder)
- similarity: xpmir.neural.interaction.common.Similarity
Which similarity function to use - ColBERT uses a cosine similarity by default
- qlen: int = 20
Maximum query length (this can be even shortened by the model)
- dlen: int = 2000
Maximum document length (this can be even shortened by the model)
- hist: xpmir.neural.interaction.drmm.CountHistogram = xpmir.neural.interaction.drmm.LogCountHistogram.XPMValue(nbins=29)
The histogram type
Hidden layer dimension for the feed forward matching network
- index: datamaestro_text.data.ir.AdhocIndex
The index (only used when using IDF to combine)
- combine: xpmir.neural.interaction.drmm.Combination = xpmir.neural.interaction.drmm.IdfCombination.XPMValue()
How to combine the query term scores
- XPM Configxpmir.neural.interaction.colbert.Colbert(*, encoder, query_encoder, similarity, qlen, dlen, linear_dim, compression_size)[source]
Bases:
InteractionScorer
Submit type:
xpmir.neural.interaction.colbert.Colbert
ColBERT model
Implementation of the Colbert model from:
Khattab, Omar, and Matei Zaharia. “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR 2020, Xi’An, China
For the standard Colbert model, use the colbert function
- encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]
The embedding model – the vocab also defines how to tokenize text
- query_encoder: xpmir.text.encoders.TokenizedTextEncoderBase[str, xpmir.text.encoders.TokensEncoderOutput]
The embedding model for queries (if None, uses encoder)
- similarity: xpmir.neural.interaction.common.Similarity
Which similarity function to use - ColBERT uses a cosine similarity by default
- qlen: int = 20
Maximum query length (this can be even shortened by the model)
- dlen: int = 2000
Maximum document length (this can be even shortened by the model)
- version: int = 2constant
Current version of the code (changes if a bug is found)
- linear_dim: int = 128
Size of the last linear layer (before computing inner products)
- compression_size: int = 128
Projection layer for the last layer (or 0 if None)
DRMM
- XPM Configxpmir.neural.interaction.drmm.Combination[source]
Bases:
Config
,TorchModule
Submit type:
xpmir.neural.interaction.drmm.Combination
- XPM Configxpmir.neural.interaction.drmm.CountHistogram(*, nbins)[source]
Bases:
Config
,TorchModule
Submit type:
xpmir.neural.interaction.drmm.CountHistogram
Base histogram class
- nbins: int = 29
number of bins in matching histogram
- XPM Configxpmir.neural.interaction.drmm.IdfCombination[source]
Bases:
Combination
Submit type:
xpmir.neural.interaction.drmm.IdfCombination
- XPM Configxpmir.neural.interaction.drmm.LogCountHistogram(*, nbins)[source]
Bases:
CountHistogram
Submit type:
xpmir.neural.interaction.drmm.LogCountHistogram
- nbins: int = 29
number of bins in matching histogram
- XPM Configxpmir.neural.interaction.drmm.NormalizedHistogram(*, nbins)[source]
Bases:
CountHistogram
Submit type:
xpmir.neural.interaction.drmm.NormalizedHistogram
- nbins: int = 29
number of bins in matching histogram
- XPM Configxpmir.neural.interaction.drmm.SumCombination[source]
Bases:
Combination
Submit type:
xpmir.neural.interaction.drmm.SumCombination
Similarity
- XPM Configxpmir.neural.interaction.common.Similarity[source]
Bases:
Config
,ABC
Submit type:
xpmir.neural.interaction.common.Similarity
Base class for similarity between two texts representations (3D tensor batch x length x dim)
- XPM Configxpmir.neural.interaction.common.DotProductSimilarity[source]
Bases:
Similarity
Submit type:
xpmir.neural.interaction.common.DotProductSimilarity
- XPM Configxpmir.neural.interaction.common.CosineSimilarity[source]
Bases:
DotProductSimilarity
Submit type:
xpmir.neural.interaction.common.CosineSimilarity
Cosine similarity between two texts representations (3D tensor batch x length x dim)
- class xpmir.neural.interaction.common.SimilarityInput(value: torch.Tensor, mask: torch.BoolTensor)[source]
Bases:
Sequence
[SimilarityInput
]
Sparse Models
- XPM Configxpmir.neural.splade.SpladeTextEncoder(*, encoder, aggregation, maxlen)[source]
Bases:
TextEncoder
,DistributableModel
Submit type:
xpmir.neural.splade.SpladeTextEncoder
Splade model
It is only a text encoder since the we use xpmir.neural.dual.DotDense as the scorer class
- encoder: xpmir.text.huggingface.TransformerTokensEncoderWithMLMOutput
The encoder from Hugging Face
- aggregation: xpmir.neural.splade.Aggregation
How to aggregate the vectors
- maxlen: int
Max length for texts
- XPM Configxpmir.neural.splade.SpladeTextEncoderV2(*, tokenizer, encoder, aggregation, maxlen)[source]
Bases:
TextEncoderBase
[InputType
,TextsRepresentationOutput
],DistributableModel
,Generic
[InputType
]Submit type:
xpmir.neural.splade.SpladeTextEncoderV2
- tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizerBase[InputType]
The tokenizer from Hugging Face
- encoder: xpmir.text.huggingface.base.HFMaskedLanguageModel
The encoder from Hugging Face
- aggregation: xpmir.neural.splade.Aggregation
How to aggregate the vectors
- maxlen: int
Max length for texts
- XPM Configxpmir.neural.splade.Aggregation[source]
Bases:
Config
Submit type:
xpmir.neural.splade.Aggregation
The aggregation function for Splade
- XPM Configxpmir.neural.splade.MaxAggregation[source]
Bases:
Aggregation
Submit type:
xpmir.neural.splade.MaxAggregation
Aggregate using a max
- XPM Configxpmir.neural.splade.SumAggregation[source]
Bases:
Aggregation
Submit type:
xpmir.neural.splade.SumAggregation
Aggregate using a sum
Generative Models
- XPM Configxpmir.neural.generative.ConditionalGenerator[source]
Bases:
Module
Submit type:
xpmir.neural.generative.ConditionalGenerator
Models that generate an identifier given a document or a query
- XPM Configxpmir.neural.generative.cross.GenerativeCrossScorer(*, pattern, generator, relevant_token_id)[source]
Bases:
LearnableScorer
Submit type:
xpmir.neural.generative.cross.GenerativeCrossScorer
A cross-encoder based on a generative model
- version: int = 2constant
Generative cross scorer version changelog: 1. corrects output type probability
- pattern: str = Query: {query} Document: {document} Relevant:
- relevant_token_id: int
HuggingFace Generative Models
- XPM Configxpmir.neural.generative.hf.LoadFromT5(*, t5_model)[source]
Bases:
LightweightTask
Submit type:
xpmir.neural.generative.hf.LoadFromT5
Load parameters from a T5 model
- t5_model: xpmir.neural.generative.hf.T5ConditionalGenerator
the target
- XPM Configxpmir.neural.generative.hf.T5IdentifierGenerator(*, hf_id, decoder_outdim)[source]
Bases:
T5ConditionalGenerator
Submit type:
xpmir.neural.generative.hf.T5IdentifierGenerator
generate the id of the token based on t5-based models
- hf_id: str
The HuggingFace identifier (to configure the model)
- decoder_outdim: int = 10
The decoder output dimension for the t5 model, use it to rebuild the lm_head and the decoder embedding, this number doesn’t include the pad token and the eos token
- XPM Configxpmir.neural.generative.hf.T5ConditionalGenerator(*, hf_id)[source]
Bases:
ConditionalGenerator
,DistributableModel
Submit type:
xpmir.neural.generative.hf.T5ConditionalGenerator
- hf_id: str
The HuggingFace identifier (to configure the model)
- XPM Configxpmir.neural.generative.hf.T5CustomOutputGenerator(*, hf_id, tokens)[source]
Bases:
T5ConditionalGenerator
Submit type:
xpmir.neural.generative.hf.T5CustomOutputGenerator
generate the id of the token based on t5-based models
- hf_id: str
The HuggingFace identifier (to configure the model)
- tokens: List[str]
From Huggingface
- XPM Configxpmir.neural.huggingface.HFCrossScorer(*, hf_id, max_length)[source]
Bases:
LearnableScorer
,DistributableModel
Submit type:
xpmir.neural.huggingface.HFCrossScorer
Load a cross scorer model from the huggingface
- hf_id: str
the id for the huggingface model
- max_length: int
the max length for the transformer model