Welcome to Experimaestro IR documentation!

experimaestro-IR (XPMIR) is a library for learning IR (neural) models. XPMIR defines a large set of components that can be composed arbitrarily, allowing to re-use easily components to build your own experiments. XPMIR is built upon experimaestro, a library which allows to build complex experimental plans while tracking parameters and to execute them locally or on a cluster.

Install

Base experimaestro-IR can be installed with pip install xpmir. Functionalities can be added by installing optional dependencies:

  • pip install xpmir[neural] to install neural-IR packages

  • pip install xpmir[anserini] to install Anserini related packages

Example

Below is an example of a simple experiment that runs BM25 and evaluates the run (on TREC-1). Note that you need the dataset to be prepared using

datamaestro datafolders set gov.nist.trec.tipster TIPSTER_PATH
datamaestro prepare gov.nist.trec.adhoc.1

with TIPSTER_PATH the path containg the TIPSTER collection (i.e. the folders Disk1, Disk2, etc.)

You can then execute the following file:

import click
from pathlib import Path
import os
from datamaestro import prepare_dataset
import logging

from experimaestro import experiment
from xpmir.evaluation import Evaluate
from xpmir.rankers.standard import BM25
from xpmir.interfaces.anserini import AnseriniRetriever, IndexCollection


logging.basicConfig(level=logging.INFO)
CPU_COUNT = len(os.sched_getaffinity(0))

# --- Defines the experiment


@click.option("--debug", is_flag=True, help="Print debug information")
@click.option("--port", type=int, default=12345, help="Port for monitoring")
@click.option("--dataset", default="gov.nist.trec.adhoc.1")
@click.argument("workdir", type=Path)
@click.command()
def cli(port, workdir, dataset, debug):
    """Runs an experiment"""
    logging.getLogger().setLevel(logging.DEBUG if debug else logging.INFO)

    bm25 = BM25()

    # Sets the working directory and the name of the xp
    with experiment(workdir, "bm25", port=port) as xp:
        # Index the collection
        xp.setenv("JAVA_HOME", os.environ["JAVA_HOME"])
        ds = prepare_dataset(dataset)

        documents = ds.documents
        index = IndexCollection(
            documents=documents,
            storePositions=True,
            storeDocvectors=True,
            storeContents=True,
            threads=CPU_COUNT,
        ).submit()

        # Search with BM25
        bm25_retriever = AnseriniRetriever(k=1500, index=index, model=bm25).tag(
            "model", "bm25"
        )

        bm25_eval = Evaluate(dataset=ds, retriever=bm25_retriever).submit()

    logging.info("BM25 results on TREC 1")
    logging.info(bm25_eval.results.read_text())


if __name__ == "__main__":
    cli()

Table of Contents

Indices and tables