Welcome to Experimaestro IR documentation!
experimaestro-IR (XPMIR) is a library for building and evaluating Information Retrieval models, with a focus on neural approaches. XPMIR defines a large set of composable components – scorers, retrievers, text encoders, samplers, and evaluation pipelines – that can be combined to build reproducible experiments.
XPMIR is built upon experimaestro, a framework that tracks parameters, manages dependencies, and executes experimental plans locally or on a cluster.
Install
Base experimaestro-IR can be installed with pip install xpmir.
Optional dependencies unlock additional functionality:
pip install xpmir[neural]– PyTorch, Transformers, and Sentence Transformers for neural IR modelspip install xpmir[anserini]– Anserini/Pyserini for classical IR models
Example
Below is a minimal experiment that indexes a collection, runs BM25, and evaluates the results on TREC-1. First, prepare the dataset:
datamaestro datafolders set gov.nist.trec.tipster TIPSTER_PATH
datamaestro prepare gov.nist.trec.adhoc.1
where TIPSTER_PATH is the path containing the TIPSTER collection (i.e. the
folders Disk1, Disk2, etc.).
Then execute the following file:
import click
from pathlib import Path
import os
from datamaestro import prepare_dataset
import logging
from experimaestro import experiment
from xpmir.evaluation import Evaluate
from xpmir.rankers.standard import BM25
from xpmir.interfaces.anserini import AnseriniRetriever, IndexCollection
logging.basicConfig(level=logging.INFO)
CPU_COUNT = len(os.sched_getaffinity(0))
# --- Defines the experiment
@click.option("--debug", is_flag=True, help="Print debug information")
@click.option("--port", type=int, default=12345, help="Port for monitoring")
@click.option("--dataset", default="gov.nist.trec.adhoc.1")
@click.argument("workdir", type=Path)
@click.command()
def cli(port, workdir, dataset, debug):
"""Runs an experiment"""
logging.getLogger().setLevel(logging.DEBUG if debug else logging.INFO)
bm25 = BM25()
# Sets the working directory and the name of the xp
with experiment(workdir, "bm25", port=port) as xp:
# Index the collection
xp.setenv("JAVA_HOME", os.environ["JAVA_HOME"])
ds = prepare_dataset(dataset)
documents = ds.documents
index = IndexCollection(
documents=documents,
storePositions=True,
storeDocvectors=True,
storeContents=True,
threads=CPU_COUNT,
).submit()
# Search with BM25
bm25_retriever = AnseriniRetriever(k=1500, index=index, model=bm25).tag(
"model", "bm25"
)
bm25_eval = Evaluate(dataset=ds, retriever=bm25_retriever).submit()
logging.info("BM25 results on TREC 1")
logging.info(bm25_eval.results.read_text())
if __name__ == "__main__":
cli()