Experimaestro-IR datasets
Beside the ir-datasets interface, Experimaestro-IR provides some datasets listed below.
-
Dataset co.huggingface.datasets.sentence-transformers.msmarco-hard-negatives.ensemble
datamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset
Hard negatives mined from a set of models
Tags: information retrieval, hard negatives, msmarco
Tasks: learning to rank
External link: https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives
-
Dataset com.github.sebastian-hofstaetter.neural-ranking-kd.msmarco.ensemble.teacher
xpmir.letor.distillation.samplers.PairwiseDistillationSamplesTSV
Training files without the text content instead using the ids from MSMARCO
External link: https://github.com/sebastian-hofstaetter/neural-ranking-kd
The teacher files (using the data from “Train Triples Small” with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation)
-
Dataset com.github.sebastian-hofstaetter.neural-ranking-kd.msmarco.bert.teacher
xpmir.letor.distillation.samplers.PairwiseDistillationSamplesTSV
Training files without the text content instead using the ids from MSMARCO
External link: https://github.com/sebastian-hofstaetter/neural-ranking-kd
The teacher files (using the data from “Train Triples Small” with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation)
Pre-computed Anserini indices provided by Jimmy Lin (U. Waterloo)
-
Dataset ca.uwaterloo.jimmylin.anserini.robust04
-
Robust 2014 index
Pre-computed Anserini index of the Robust 2014 collection; used parameters can be found at https://git.uwaterloo.ca/jimmylin/anserini-indexes/-/blob/master/index-robust04-20191213-readme.txt