Task composition

In the previous section, we saw how to define a task with parameters. In this section, we will see how to compose tasks together to form a pipeline.

To run the BEIR benchmark with the Pyserini package, we envision the following pipeline:

graph LR;
raw_beir_data --> beir_to_trec;
beir_to_trec --> index;
index --> retrieve;
beir_to_trec --> retrieve;
retrieve --> evaluate;
beir_to_trec --> evaluate;

Essentially, our pipeline consists of the following steps:

  • Download the raw data in raw_beir_data;
  • Preprocess the data to the standard TREC format in beir_to_trec;
  • Index the data in index to create a BM25 index (depending on the preprocesed data);
  • Retrieve the top-100 documents for each query in retrieve (depending on the index and the data);
  • Evaluate the retrieval results in evaluate (depending on the data and the retrieved results).

To compose these tasks, we need to define the dependencies between them. This is done by specifying the output of one task as the input of another task.

task raw_beir_data(beir_dataset=$, beir_url_prefix=$) -> out:
  ...
  
task beir_to_trec(data=$raw_beir_data.out) -> out:
  ...

task index(data=$beir_to_trec.out) -> out:
  ...

task retrieve(data=$beir_to_trec.out, index=$index.out) -> out:
  ...

task evaluate(data=$raw_beir_data.out, rseult=$retrieve.out) -> out:
  ...

In the next sections we will implement these tasks one by one.