Welcome to Experimaestro IR documentation!

experimaestro-IR (XPMIR) is a library for building and evaluating Information Retrieval models, with a focus on neural approaches. XPMIR defines a large set of composable components – scorers, retrievers, text encoders, samplers, and evaluation pipelines – that can be combined to build reproducible experiments.

XPMIR is built upon experimaestro, a framework that tracks parameters, manages dependencies, and executes experimental plans locally or on a cluster.

Install

Base experimaestro-IR can be installed with pip install xpmir. Optional dependencies unlock additional functionality:

  • pip install xpmir[neural] – PyTorch, Transformers, and Sentence Transformers for neural IR models

  • pip install xpmir[anserini] – Anserini/Pyserini for classical IR models

Example

Below is a minimal experiment that indexes a collection, runs BM25, and evaluates the results on TREC-1. First, prepare the dataset:

datamaestro datafolders set gov.nist.trec.tipster TIPSTER_PATH
datamaestro prepare gov.nist.trec.adhoc.1

where TIPSTER_PATH is the path containing the TIPSTER collection (i.e. the folders Disk1, Disk2, etc.).

Then execute the following file:

import click
from pathlib import Path
import os
from datamaestro import prepare_dataset
import logging

from experimaestro import experiment
from xpmir.evaluation import Evaluate
from xpmir.rankers.standard import BM25
from xpmir.interfaces.anserini import AnseriniRetriever, IndexCollection


logging.basicConfig(level=logging.INFO)
CPU_COUNT = len(os.sched_getaffinity(0))

# --- Defines the experiment


@click.option("--debug", is_flag=True, help="Print debug information")
@click.option("--port", type=int, default=12345, help="Port for monitoring")
@click.option("--dataset", default="gov.nist.trec.adhoc.1")
@click.argument("workdir", type=Path)
@click.command()
def cli(port, workdir, dataset, debug):
    """Runs an experiment"""
    logging.getLogger().setLevel(logging.DEBUG if debug else logging.INFO)

    bm25 = BM25()

    # Sets the working directory and the name of the xp
    with experiment(workdir, "bm25", port=port) as xp:
        # Index the collection
        xp.setenv("JAVA_HOME", os.environ["JAVA_HOME"])
        ds = prepare_dataset(dataset)

        documents = ds.documents
        index = IndexCollection(
            documents=documents,
            storePositions=True,
            storeDocvectors=True,
            storeContents=True,
            threads=CPU_COUNT,
        ).submit()

        # Search with BM25
        bm25_retriever = AnseriniRetriever(k=1500, index=index, model=bm25).tag(
            "model", "bm25"
        )

        bm25_eval = Evaluate(dataset=ds, retriever=bm25_retriever).submit()

    logging.info("BM25 results on TREC 1")
    logging.info(bm25_eval.results.read_text())


if __name__ == "__main__":
    cli()

Table of Contents

Indices and tables