Packages
Our previous sketch requires some packages to be built.
- A
conda
package that contains a bunch of Python libraries (mainly Pyserini) to run BM25 search; - The NIST
trec_eval
package to evaluate the retrieval results.
We will define these packages in HyperMake and let them be part of the whole pipeline, so when a user runs the pipeline, the packages will be built and installed automatically.
A package in HyperMake is defined with the package
keyword, and it is a special kind of task.
Creating a Conda package
package pyserini -> out:
mkdir -p $out
conda create -y \
-p $out \
-c conda-forge \
python=3.10 openjdk=21
$out/bin/pip install torch faiss-cpu pyserini
We declared a package named pyserini
that when building, creates a new Conda environment with Python 3.10 and OpenJDK 21, and installs Pyserini in it. Note that we build the package in a HyperMake-managed, separate directory $out
(with the -p
/--prefix
directive of Conda) instead of a global Conda environment.
This is common so that HyperMake provided standard library subroutines to make this easier:
import conda
package pyserini = conda.create(
packages="python=3.10 openjdk=21",
extra_args="-c conda-forge",
extra_pip_packages="torch faiss-cpu pyserini"
)
Building the trec_eval
package
trec_eval
is a C package built with Make. We can define a package for it as well:
package trec_eval -> out:
git clone https://github.com/usnistgov/trec_eval.git $out
cd $out
make
In HyperMake, a package must have exactly 1 output: the built package directory. To refer to the output directory, directly use the package name as a variable (e.g.
$pyserini
,$trec_eval
here).
What is exactly the difference between a package and a task? When a HyperMake pipeline is defined across multiple file systems (e.g. local, AWS, SSH, etc.), a task is only run once and transferred between file systems, while a package is separately built on each file system.