Neural models
Cross-Encoder
Models that rely on a joint representation of the query and the document.
- XPM Configxpmir.neural.cross.CrossScorer(*, checkpoint, encoder)
Bases:
xpmir.neural.TorchLearnableScorer
Query-Document Representation Classifier
Based on a query-document representation representation (e.g. BERT [CLS] token). AKA Cross-Encoder
- checkpoint: Path
A checkpoint path from which the model should be loaded (or None otherwise)
- encoder: xpmir.text.encoders.DualTextEncoder
Document (and query) encoder
Dense models
- XPM Configxpmir.neural.DualRepresentationScorer(*, checkpoint)
Bases:
xpmir.neural.TorchLearnableScorer
Neural scorer based on (at least a partially) independant representation of the document and the question.
This is the base class for all scorers that depend on a map of cosine/inner products between query and document tokens.
- checkpoint: Path
A checkpoint path from which the model should be loaded (or None otherwise)
- score_pairs(queries, documents, info: Optional[xpmir.letor.context.TrainerContext]) <Mock name='mock.Tensor' id='139719205112128'>
Score the specified pairs of queries/documents.
There are as many queries as documents. The exact type of queries and documents depends on the specific instance of the dual representation scorer.
- Parameters
queries (Any) – The list of encoded queries
documents (Any) – The matching list of encoded documents
info (Optional[TrainerContext]) – _description_
- Returns
A tensor of dimension (N) where N is the number of documents/queries
- Return type
torch.Tensor
- score_product(queries, documents, info: Optional[xpmir.letor.context.TrainerContext])
Computes the score of all possible pairs of query and document
- Parameters
queries (Any) – The encoded queries
documents (Any) – The encoded documents
info (Optional[TrainerContext]) – The training context (if learning)
- Returns
A tensor of dimension (N, P) where N is the number of queries and P the number of documents
- Return type
torch.Tensor
- XPM Configxpmir.neural.dual.Dense(*, checkpoint, encoder, query_encoder)
Bases:
xpmir.neural.dual.DualVectorScorer
A scorer based on a pair of (query, document) dense vectors
- checkpoint: Path
A checkpoint path from which the model should be loaded (or None otherwise)
- encoder: xpmir.text.encoders.TextEncoder
The document (and potentially query) encoder
- query_encoder: xpmir.text.encoders.TextEncoder
The query encoder (optional, if not defined uses the query_encoder)
- XPM Configxpmir.neural.dual.DotDense(*, checkpoint, encoder, query_encoder)
Bases:
xpmir.neural.dual.Dense
Dual model based on inner product.
- checkpoint: Path
A checkpoint path from which the model should be loaded (or None otherwise)
- encoder: xpmir.text.encoders.TextEncoder
The document (and potentially query) encoder
- query_encoder: xpmir.text.encoders.TextEncoder
The query encoder (optional, if not defined uses the query_encoder)
- XPM Configxpmir.neural.dual.CosineDense(*, checkpoint, encoder, query_encoder)
Bases:
xpmir.neural.dual.Dense
Dual model based on cosine similarity.
- checkpoint: Path
A checkpoint path from which the model should be loaded (or None otherwise)
- encoder: xpmir.text.encoders.TextEncoder
The document (and potentially query) encoder
- query_encoder: xpmir.text.encoders.TextEncoder
The query encoder (optional, if not defined uses the query_encoder)
Interaction models
Interaction-based neural scorer |
|
Deep Relevance Matching Model (DRMM) |
|
ColBERT model |
- XPM Configxpmir.neural.interaction.InteractionScorer(*, checkpoint, vocab, qlen, dlen)
Bases:
xpmir.neural.TorchLearnableScorer
Interaction-based neural scorer
This is the base class for all scorers that depend on a map of cosine/inner products between query and document token representations.
- checkpoint: Path
A checkpoint path from which the model should be loaded (or None otherwise)
- vocab: xpmir.text.Vocab
The embedding model – the vocab also defines how to tokenize text
- qlen: int = 20
Maximum query length (this can be even shortened by the model)
- dlen: int = 2000
Maximum document length (this can be even shortened by the model)
- XPM Configxpmir.neural.interaction.drmm.Drmm(*, checkpoint, vocab, qlen, dlen, hist, hidden, index, combine)
Bases:
xpmir.neural.interaction.InteractionScorer
Deep Relevance Matching Model (DRMM)
Implementation of the DRMM model from:
Jiafeng Guo, Yixing Fan, Qingyao Ai, and William Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM.
- checkpoint: Path
A checkpoint path from which the model should be loaded (or None otherwise)
- vocab: xpmir.text.Vocab
The embedding model – the vocab also defines how to tokenize text
- qlen: int = 20
Maximum query length (this can be even shortened by the model)
- dlen: int = 2000
Maximum document length (this can be even shortened by the model)
- hist: xpmir.neural.interaction.drmm.CountHistogram = xpmir.neural.interaction.drmm.LogCountHistogram(nbins=29)
The histogram type
Hidden layer dimension for the feed forward matching network
- index: datamaestro_text.data.ir.AdhocIndex
The index (only used when using IDF to combine)
- combine: xpmir.neural.interaction.drmm.Combination = xpmir.neural.interaction.drmm.IdfCombination()
How to combine the query term scores
- XPM Configxpmir.neural.colbert.Colbert(*, checkpoint, vocab, qlen, dlen, masktoken, querytoken, doctoken, similarity, linear_dim, compression_size)
Bases:
xpmir.neural.interaction.InteractionScorer
ColBERT model
Implementation of the Colbert model from:
Khattab, Omar, and Matei Zaharia. “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR 2020, Xi’An, China
For the standard Colbert model, use BERT as the vocab(ulary)
- checkpoint: Path
A checkpoint path from which the model should be loaded (or None otherwise)
- vocab: xpmir.text.Vocab
The embedding model – the vocab also defines how to tokenize text
- qlen: int = 20
Maximum query length (this can be even shortened by the model)
- dlen: int = 2000
Maximum document length (this can be even shortened by the model)
- version: int = 2constant
Current version of the code (changes when a bug is found)
- masktoken: bool = True
Whether a [MASK] token should be used instead of padding
- querytoken: bool = True
Whether a specific query token should be used as a prefix to the question
- doctoken: bool = True
Whether a specific document token should be used as a prefix to the document
- similarity: xpmir.neural.colbert.Similarity = xpmir.neural.colbert.CosineDistance()
Which similarity to use
- linear_dim: int = 128
Size of the last linear layer (before computing inner products)
- compression_size: int = 128
Projection layer for the last layer (or 0 if None)
Sparse Models
- XPM Configxpmir.neural.splade.SpladeTextEncoder(*, encoder, aggregation, maxlen)
Bases:
xpmir.text.encoders.TextEncoder
Splade model
It is only a text encoder since the we use xpmir.neural.dual.DotDense
- encoder: xpmir.text.huggingface.TransformerVocab
The encoder from Hugging Face
- aggregation: xpmir.neural.splade.Aggregation
How to aggregate the vectors
- maxlen: int
Max length for texts
- XPM Configxpmir.neural.splade.Aggregation
Bases:
experimaestro.core.objects.Config
The aggregation function for Splade
- XPM Configxpmir.neural.splade.MaxAggregation
Bases:
xpmir.neural.splade.Aggregation
Aggregate using a max
- XPM Configxpmir.neural.splade.SumAggregation
Bases:
xpmir.neural.splade.Aggregation
Aggregate using a sum
Pretrained models
- xpmir.neural.pretrained.spladev2() xpmir.neural.dual.DotDense
The Splade V2 model (from https://github.com/naver/splade)
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval, Thibault Formal, Benjamin Piwowarski, Carlos Lassance, and Stéphane Clinchant.
- xpmir.neural.pretrained.tas_balanced()
Returns the TAS-Balanced model (from huggingface)
> Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling > Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, Allan Hanbury
- Returns
A DotDense ranker based on tas-balanced
- Return type