Dataset adapters

Adapters derive new datasets from existing ones by subsampling documents, topics, or relevance assessments. They are useful for creating train/validation splits, restricting evaluation to a subset of topics, or building retriever-based collections for re-ranking experiments.

Adhoc datasets 

Split or combine ad-hoc retrieval datasets into folds.

XPM Taskxpmir.datasets.adapters.RandomFold(*, seed, sizes, dataset, fold, exclude)[source]

Bases: Task

Submit type: datamaestro_ir.data.Adhoc

Extracts a random subset of topics from a dataset

seed: int: Random seed used to compute the fold

sizes: List[float]: Number of topics of each fold (or percentage if sums to 1)

dataset: datamaestro_ir.data.Adhoc: The Adhoc dataset from which a fold is extracted

fold: int: Which fold should be taken

exclude: datamaestro_ir.data.Topics: Exclude some topics from the random fold

assessments: pathgenerated: Generated assessments file

topics: pathgenerated: Generated topics file

static folds(seed: int, sizes: ~typing.List[float], dataset: ~typing.Annotated[~datamaestro_ir.data.Adhoc, <experimaestro.core.arguments._Param object at 0x7cd005767ef0>], exclude: ~typing.Annotated[~datamaestro_ir.data.Topics | None, <experimaestro.core.arguments._Param object at 0x7cd005767ef0>] = None, submit=True)[source]

Creates folds

Parameters:

submit: if true (default), submits the fold tasks to experimaestro

XPM Taskxpmir.datasets.adapters.ConcatFold(*, datasets)[source]

Bases: Task

Submit type: datamaestro_ir.data.Adhoc

Concatenation of several datasets to get a full dataset.

datasets: List[datamaestro_ir.data.Adhoc]: The list of Adhoc datasets to concatenate

assessments: pathgenerated: Generated assessments file

topics: pathgenerated: Generated topics file

Documents 

Create document subsets, e.g. restricting a collection to documents returned by a first-stage retriever.

XPM Taskxpmir.datasets.adapters.RetrieverBasedCollection(*, relevance_threshold, dataset, retrievers, keepRelevant, keepNotRelevant)[source]

Bases: Task

Submit type: datamaestro_ir.data.Adhoc.XPMConfig

Buils a subset of documents based on the output of a set of retrievers and on relevance assessment. First get all the document based on the assessment then add the retrieved ones.

relevance_threshold: float = 0: Relevance threshold

dataset: datamaestro_ir.data.Adhoc: A dataset

retrievers: List[xpmir.rankers.retriever.Retriever]: Rankers

keepRelevant: bool = True: Keep documents judged relevant

keepNotRelevant: bool = False: Keep documents judged not relevant

docids_path: pathgenerated: The file containing the document identifiers of the collection

XPM Configxpmir.datasets.adapters.DocumentSubset(*, id, count, base, docids_path, in_memory)[source]

Bases: Documents

ID-based topic selection

id: str: The unique (sub-)dataset ID

count: int: Number of documents

base: datamaestro_ir.data.DocumentStore: The full document store

docids_path: path: Path to the file containing the document IDs

in_memory: bool = False: Whether to load the dataset in memory

Assessments 

Fold relevance assessments (qrels) by topic ID or topic object.

XPM Configxpmir.datasets.adapters.AbstractAdhocAssessmentFold(*, id, qrels)[source]

Bases: AdhocAssessments

Filter assessments by topic ID

id: str: The unique (sub-)dataset ID

qrels: datamaestro_ir.data.AdhocAssessments: The collection of the assessments

XPM Configxpmir.datasets.adapters.AdhocAssessmentFold(*, id, qrels, ids)[source]

Bases: AbstractAdhocAssessmentFold

Filter assessments by topic ID

id: str: The unique (sub-)dataset ID

qrels: datamaestro_ir.data.AdhocAssessments: The collection of the assessments

ids: List[str]: A set of the ids for the assessments where we select from

XPM Configxpmir.datasets.adapters.IDAdhocAssessmentFold(*, id, qrels, id_list)[source]

Bases: AbstractAdhocAssessmentFold

id: str: The unique (sub-)dataset ID

qrels: datamaestro_ir.data.AdhocAssessments: The collection of the assessments

id_list: xpmir.misc.IDList

Topics 

Fold or generate topic sets.

XPM Configxpmir.datasets.adapters.AbstractTopicFold(*, id, topics)[source]

Bases: Topics

ID-based topic selection

id: str: The unique (sub-)dataset ID

topics: datamaestro_ir.data.Topics: The collection of the topics

XPM Configxpmir.datasets.adapters.TopicFold(*, id, topics, ids)[source]

Bases: AbstractTopicFold

id: str: The unique (sub-)dataset ID

topics: datamaestro_ir.data.Topics: The collection of the topics

ids: List[str]: A set of the ids for the topics where we select from

XPM Configxpmir.datasets.adapters.IDTopicFold(*, id, topics, id_list)[source]

Bases: AbstractTopicFold

id: str: The unique (sub-)dataset ID

topics: datamaestro_ir.data.Topics: The collection of the topics

id_list: xpmir.misc.IDList

XPM Taskxpmir.datasets.adapters.TopicsFoldGenerator(*, seed, sizes, dataset, fold, exclude)[source]

Bases: FileIDList, Task

Submit type: datamaestro_ir.data.Adhoc

Extracts a random subset of topics from a dataset

This task is more generic than the RandomFold one and should work whatever the topics/assessments as long as they are serializable (using pickle).

path: pathgenerated: Selected topic IDs

seed: int: Random seed used to compute the fold

sizes: List[float]: Number of topics of each fold (or percentage if sums to 1)

dataset: datamaestro_ir.data.Adhoc: The Adhoc dataset from which a fold is extracted

fold: int: Which fold should be taken

exclude: datamaestro_ir.data.Topics: Exclude some topics from the random fold

XPM Configxpmir.datasets.adapters.MemoryTopicStore(*, topics)[source]

Bases: TextStore

View a set of topics as a (in memory) text store

topics: datamaestro_ir.data.Topics: The collection of the topics to build the store

XPM Configxpmir.datasets.adapters.TextStore[source]

Bases: Config

Associates an ID with a text

Dataset adapters

Adhoc datasets

Documents

Assessments

Topics

Adhoc datasets 

Documents 

Assessments 

Topics 