Experimaestro-IR datasets

Beside the ir-datasets interface, Experimaestro-IR provides some datasets listed below.

Dataset co.huggingface.datasets.sentence-transformers.msmarco-hard-negatives.ensemble

datamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset

Hard negatives mined from a set of models

Tags: hard negatives, information retrieval, msmarco

Tasks: learning to rank

External link: https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives

Dataset com.github.sebastian-hofstaetter.neural-ranking-kd.msmarco.ensemble.teacher

xpmir.letor.distillation.samplers.PairwiseDistillationSamplesTSV

Training files without the text content instead using the ids from MSMARCO

External link: https://github.com/sebastian-hofstaetter/neural-ranking-kd

The teacher files (using the data from “Train Triples Small” with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation)

Dataset com.github.sebastian-hofstaetter.neural-ranking-kd.msmarco.bert.teacher

xpmir.letor.distillation.samplers.PairwiseDistillationSamplesTSV

Training files without the text content instead using the ids from MSMARCO

External link: https://github.com/sebastian-hofstaetter/neural-ranking-kd

The teacher files (using the data from “Train Triples Small” with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation)

Pre-computed Anserini indices provided by Jimmy Lin (U. Waterloo)

Dataset ca.uwaterloo.jimmylin.anserini.robust04

xpmir.index.anserini.Index

Robust 2014 index

Pre-computed Anserini index of the Robust 2014 collection; used parameters can be found at https://git.uwaterloo.ca/jimmylin/anserini-indexes/-/blob/master/index-robust04-20191213-readme.txt