Pairwise learning

In pairwise learning, each training instance is a (query, positive document, negative document) triplet. The model learns to rank the positive document above the negative one, typically optimised with a margin-based or cross-entropy loss.

Trainer

XPM Configxpmir.letor.trainers.pairwise.PairwiseTrainer(*, hooks, model, sampler, batch_size, num_workers, lossfn)[source]

Bases: LossTrainer

Pairwise trainer uses samples of the form (query, positive, negative)

hooks: List[xpm_torch.trainers.context.TrainingHook] = []

Hooks for this trainer: this includes the losses, but can be adapted for other uses The specific list of hooks depends on the specific trainer

model: xpm_torch.module.Module

If the model to optimize is different from the model passsed to Learn, this parameter can be used – initialization is still expected to be done at the learner level

batcher: xpm_torch.batchers.Batchergenerated

How to batch samples together

sampler: xpm_torch.base.Sampler

The sampler to use

batch_size: int = 16

Number of samples per batch

num_workers: int = 2

Number of DataLoader workers

lossfn: xpm_torch.losses.pairwise.PairwiseLoss

The loss function

Samplers

Samplers produce pairwise training triplets from different data sources (model scores, pre-computed files, or in-batch negatives).

XPM Configxpmir.letor.samplers.PairwiseModelBasedSampler(*, dataset, retriever)[source]

Bases: ModelBasedSampler, Sampler[PairwiseItem]

A pairwise sampler based on a retrieval model

dataset: datamaestro_ir.data.Adhoc

The IR adhoc dataset

retriever: xpmir.rankers.retriever.Retriever

A retriever to sample negative documents

XPM Configxpmir.letor.samplers.TripletBasedSampler(*, source)[source]

Bases: Sampler[PairwiseItem]

Sampler based on a triplet source

source: datamaestro_ir.data.TrainingTriplets

Triplets

XPM Configxpmir.letor.samplers.PairwiseDatasetTripletBasedSampler(*, documents, dataset, negative_algo)[source]

Bases: Sampler[PairwiseItem]

Sampler based on a dataset where each query is associated with (1) a set of relevant documents (2) negative documents, where each negative is sampled with a specific algorithm

documents: datamaestro_ir.data.DocumentStore

The document store

dataset: datamaestro_ir.data.PairwiseSampleDataset

The dataset which contains the generated queries with its positives and negatives

negative_algo: str = random

The algo to sample the negatives, default value is random

XPM Configxpmir.letor.samplers.PairwiseInBatchNegativesSampler(*, sampler)[source]

Bases: Sampler[BatchwiseItems]

An in-batch negative sampler constructured from a pairwise one

sampler: xpm_torch.base.Sampler[xpmir.letor.records.PairwiseItem]

The base pairwise sampler

XPM Configxpmir.letor.samplers.PairwiseSamplerFromTSV(*, pairwise_samples_path)[source]

Bases: Sampler[PairwiseItem]

pairwise_samples_path: path

The path which stores the existing triplets

XPM Taskxpmir.letor.samplers.ModelBasedHardNegativeSampler(*, dataset, retriever)[source]

Bases: Task, Sampler

Submit type: datamaestro_ir.data.PairwiseSampleDataset

Retriever-based hard negative sampler

dataset: datamaestro_ir.data.Adhoc

The dataset which contains the topics and assessments

retriever: xpmir.rankers.retriever.Retriever

The retriever to score of the document wrt the query

hard_negative_samples: pathgenerated

Path to store the generated hard negatives

Dataset types

Pre-computed pairwise datasets stored as JSONL or TSV files.

XPM Configxpmir.letor.samplers.JSONLPairwiseSampleDataset(*, id, path)[source]

Bases: PairwiseSampleDataset

Transform a JSONL file to a pairwise dataset.

General format:

{
    "queries": ["str", "str"],
    "pos_ids": ["id", "id"],
    "neg_ids": {
        "bm25": ["id", "id"],
        "random": ["id", "id"]
    }
}
id: str

The unique (sub-)dataset ID

path: path

The path to the Jsonl file

XPM Configxpmir.letor.samplers.TSVPairwiseSampleDataset(*, id, hard_negative_samples_path)[source]

Bases: PairwiseSampleDataset

Read the pairwise sample dataset from a tsv file

id: str

The unique (sub-)dataset ID

hard_negative_samples_path: path

The path which stores the existing ids

Adapters

XPM Configxpmir.letor.samplers.adapters.SamplerAdapter(*, sampler, processors, buffer_size)[source]

Bases: Sampler[SampleT]

Wraps a sampler with processors that transform its output.

The adapter takes an input Sampler and applies a chain of RecordsProcessors to transform the samples.

sampler: xpm_torch.base.Sampler
processors: List[xpmir.letor.processors.RecordsProcessor]
buffer_size: int = 64

Processors

XPM Configxpmir.letor.processors.StoreHydrator(*, documentstore, querystore)[source]

Bases: DocumentsProcessor[DocIn, QueryIn, DocOut], QueriesProcessor[DocIn, QueryIn, QueryOut]

Hydrates ID-only records with text from document/query stores.

When documentstore is set, documents are hydrated via documents_ext(). When querystore is set, queries are hydrated via store lookup. For documents with ScoredItem, the score is preserved via ScoredDocument.

documentstore: datamaestro_ir.data.DocumentStore
querystore: xpmir.datasets.adapters.TextStore