Welcome to Experimaestro IR documentation!
experimaestro-IR (XPMIR) is a library for learning IR (neural) models. XPMIR defines a large set of components that can be composed arbitrarily, allowing to re-use easily components to build your own experiments. XPMIR is built upon experimaestro, a library which allows to build complex experimental plans while tracking parameters and to execute them locally or on a cluster.
Install
Base experimaestro-IR can be installed with pip install xpmir. Functionalities can be added by installing optional dependencies:
pip install xpmir[neural] to install neural-IR packages
pip install xpmir[anserini] to install Anserini related packages
Example
Below is an example of a simple experiment that runs BM25 and evaluates the run (on TREC-1). Note that you need the dataset to be prepared using
datamaestro datafolders set gov.nist.trec.tipster TIPSTER_PATH
datamaestro prepare gov.nist.trec.adhoc.1
with TIPSTER_PATH the path containg the TIPSTER collection (i.e. the folders Disk1, Disk2, etc.)
You can then execute the following file:
import click
from pathlib import Path
import os
from datamaestro import prepare_dataset
import logging
import multiprocessing
logging.basicConfig(level=logging.INFO)
CPU_COUNT = multiprocessing.cpu_count()
from experimaestro import experiment
from xpmir.evaluation import Evaluate
from xpmir.rankers.standard import BM25
from xpmir.interfaces.anserini import AnseriniRetriever, IndexCollection
# --- Defines the experiment
@click.option("--debug", is_flag=True, help="Print debug information")
@click.option("--port", type=int, default=12345, help="Port for monitoring")
@click.option("--dataset", default="gov.nist.trec.adhoc.1")
@click.argument("workdir", type=Path)
@click.command()
def cli(port, workdir, dataset, debug):
"""Runs an experiment"""
logging.getLogger().setLevel(logging.DEBUG if debug else logging.INFO)
bm25 = BM25()
# Sets the working directory and the name of the xp
with experiment(workdir, "bm25", port=port) as xp:
# Index the collection
xp.setenv("JAVA_HOME", os.environ["JAVA_HOME"])
ds = prepare_dataset(dataset)
documents = ds.documents
index = IndexCollection(
documents=documents,
storePositions=True,
storeDocvectors=True,
storeContents=True,
threads=CPU_COUNT,
).submit()
# Search with BM25
bm25_retriever = AnseriniRetriever(k=1500, index=index, model=bm25).tag(
"model", "bm25"
)
bm25_eval = Evaluate(dataset=ds, retriever=bm25_retriever).submit()
print("BM25 results on TREC 1")
print(bm25_eval.results.read_text())
if __name__ == "__main__":
cli()