Dataset adapters
Adapters can be used when a collection is derived from another one by subsampling document and/or queries.
Adhoc datasets
- XPM Taskxpmir.datasets.adapters.RandomFold(*, seed, sizes, dataset, fold, exclude)[source]
Bases:
TaskSubmit type:
datamaestro_text.data.ir.AdhocExtracts a random subset of topics from a dataset
- seed: int
Random seed used to compute the fold
- sizes: List[float]
Number of topics of each fold (or percentage if sums to 1)
- dataset: datamaestro_text.data.ir.Adhoc
The Adhoc dataset from which a fold is extracted
- fold: int
Which fold should be taken
- exclude: datamaestro_text.data.ir.Topics
Exclude some topics from the random fold
- assessments: Pathgenerated
Generated assessments file
- topics: Pathgenerated
Generated topics file
- XPM Taskxpmir.datasets.adapters.ConcatFold(*, datasets)[source]
Bases:
TaskSubmit type:
datamaestro_text.data.ir.AdhocConcatenation of several datasets to get a full dataset.
- datasets: List[datamaestro_text.data.ir.Adhoc]
The list of Adhoc datasets to concatenate
- assessments: Pathgenerated
Generated assessments file
- topics: Pathgenerated
Generated topics file
Documents
- XPM Taskxpmir.datasets.adapters.RetrieverBasedCollection(*, relevance_threshold, dataset, retrievers, keepRelevant, keepNotRelevant)[source]
Bases:
TaskSubmit type:
datamaestro_text.data.ir.AdhocBuils a subset of documents based on the output of a set of retrievers and on relevance assessment. First get all the document based on the assessment then add the retrieved ones.
- relevance_threshold: float = 0
Relevance threshold
- dataset: datamaestro_text.data.ir.Adhoc
A dataset
- retrievers: List[xpmir.rankers.Retriever]
Rankers
- keepRelevant: bool = True
Keep documents judged relevant
- keepNotRelevant: bool = False
Keep documents judged not relevant
- docids_path: Pathgenerated
The file containing the document identifiers of the collection
- XPM Configxpmir.datasets.adapters.DocumentSubset(*, id, count, base, docids_path, in_memory)[source]
Bases:
DocumentsSubmit type:
xpmir.datasets.adapters.DocumentSubsetID-based topic selection
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- base: datamaestro_text.data.ir.DocumentStore
The full document store
- docids_path: Path
Path to the file containing the document IDs
- in_memory: bool = False
Whether to load the dataset in memory
Assessments
- XPM Configxpmir.datasets.adapters.AbstractAdhocAssessmentFold(*, id, qrels)[source]
Bases:
AdhocAssessmentsSubmit type:
xpmir.datasets.adapters.AbstractAdhocAssessmentFoldFilter assessments by topic ID
- id: str
The unique (sub-)dataset ID
- qrels: datamaestro_text.data.ir.AdhocAssessments
The collection of the assessments
- XPM Configxpmir.datasets.adapters.AdhocAssessmentFold(*, id, qrels, ids)[source]
Bases:
AbstractAdhocAssessmentFoldSubmit type:
xpmir.datasets.adapters.AdhocAssessmentFoldFilter assessments by topic ID
- id: str
The unique (sub-)dataset ID
- qrels: datamaestro_text.data.ir.AdhocAssessments
The collection of the assessments
- ids: List[str]
A set of the ids for the assessments where we select from
- XPM Configxpmir.datasets.adapters.IDAdhocAssessmentFold(*, id, qrels, id_list)[source]
Bases:
AbstractAdhocAssessmentFoldSubmit type:
xpmir.datasets.adapters.IDAdhocAssessmentFold- id: str
The unique (sub-)dataset ID
- qrels: datamaestro_text.data.ir.AdhocAssessments
The collection of the assessments
- id_list: xpmir.misc.IDList
Topics
- XPM Configxpmir.datasets.adapters.AbstractTopicFold(*, id, topics)[source]
Bases:
TopicsSubmit type:
xpmir.datasets.adapters.AbstractTopicFoldID-based topic selection
- id: str
The unique (sub-)dataset ID
- topics: datamaestro_text.data.ir.Topics
The collection of the topics
- XPM Configxpmir.datasets.adapters.TopicFold(*, id, topics, ids)[source]
Bases:
AbstractTopicFoldSubmit type:
xpmir.datasets.adapters.TopicFold- id: str
The unique (sub-)dataset ID
- topics: datamaestro_text.data.ir.Topics
The collection of the topics
- ids: List[str]
A set of the ids for the topics where we select from
- XPM Configxpmir.datasets.adapters.IDTopicFold(*, id, topics, id_list)[source]
Bases:
AbstractTopicFoldSubmit type:
xpmir.datasets.adapters.IDTopicFold- id: str
The unique (sub-)dataset ID
- topics: datamaestro_text.data.ir.Topics
The collection of the topics
- id_list: xpmir.misc.IDList
- XPM Taskxpmir.datasets.adapters.TopicsFoldGenerator(*, seed, sizes, dataset, fold, exclude)[source]
Bases:
FileIDList,TaskSubmit type:
datamaestro_text.data.ir.AdhocExtracts a random subset of topics from a dataset
This task is more generic than the RandomFold one and should work whatever the topics/assessments as long as they are serializable (using pickle).
- path: Pathgenerated
Selected topic IDs
- seed: int
Random seed used to compute the fold
- sizes: List[float]
Number of topics of each fold (or percentage if sums to 1)
- dataset: datamaestro_text.data.ir.Adhoc
The Adhoc dataset from which a fold is extracted
- fold: int
Which fold should be taken
- exclude: datamaestro_text.data.ir.Topics
Exclude some topics from the random fold
- XPM Configxpmir.datasets.adapters.MemoryTopicStore(*, topics)[source]
Bases:
TextStoreSubmit type:
xpmir.datasets.adapters.MemoryTopicStoreView a set of topics as a (in memory) text store
- topics: datamaestro_text.data.ir.Topics
The collection of the topics to build the store
- XPM Configxpmir.datasets.adapters.TextStore[source]
Bases:
ConfigSubmit type:
xpmir.datasets.adapters.TextStoreAssociates an ID with a text