HyperMake
HyperMake
is a parameterized pipeline definition language (think of a make
where tasks can be parameterized) heavily inspired by Ducttape.
- Shell scripting: Write tasks in plain Bash, just like
make
. No Python or YAML for defining tasks. - Cached intermediate results: Intermediate results are cached, so that if a task fails, it can be re-run from the last successful task.
- Parameterization of tasks: a task can be of multiple versions (e.g. in a ML pipeline, a task can be of different hyperparameters).
- Minimal juggling: Inputs and outputs are just files/symlinks, and arguments are all passed as environment variables. Tasks are realized as a child process.
- Automatic parallelization: based on the dependency DAG.
- Cloud-agnostic: tasks can be run locally, or on a cloud (e.g. AWS, Azure), with minimal changes to the pipeline.
Installation
Via Homebrew
HyperMake can be installed via brew
with a custom tap.
brew tap ctongfei/repo
brew install --HEAD ctongfei/repo/hypermake
Right now the Homebrew formula is configured to use the HEAD
version of HyperMake, which is the latest version on the main
branch. To reinstall if the latest version changed, do
brew reinstall ctongfei/repo/hypermake
Building from source
HyperMake can also be directly built from source. It requires sbt
to build and JDK 8+ to run.
git clone https://github.com/ctongfei/hypermake
cd hypermake
make
make install
This will build the HyperMake binary and install it locally to $HOME/.local/bin/hypermake
.
To install it elsewhere, simply modify the $PREFIX
variable in the Makefile
.
In the following sections we will first glimpse into HyperMake by running a simple "Hello, world" task.
Then, we will gradually introduce more advanced features of HyperMake to build a pipeline for running the BEIR (paper) benchmark1.
BEIR is a robust and heterogeneous evaluation benchmark for zero-shot information retrieval. It includes a diverse set of retrieval tasks, such as web search, question answering, and entity retrieval. The benchmark is designed to evaluate the generalization capabilities of retrieval models across different tasks and domains.
Hello, world!
To introduce HyperMake, let's define our first task:
task hello:
echo "Hello, world!"
Save this file as hello.hm
.
We have created our first HyperMake script file that contains a single task.
This script defines a task named hello
that prints Hello, world!
to the console. There is no input or output for this task.
Note the syntax here: A code block starts after the
:
at the end of the task signature. A code block is a consecutive list of indented lines of scripts, where each line must start with at least 2 spaces. By default, the script is written in Bash.
If you are familiar with make
, you can think of a task as a make
rule. The task above written as a Makefile would just be
hello:
echo "Hello, world!"
.PHONY: hello # since this task does not produce any output
Now let's run this task!
Execute the following command in your shell:
hypermake hello.hm run hello
We should see the output "Hello, world!" printed in the terminal.
The basic command line usage is
hypermake $script <subtask> $target
. Here the<subtask>
is simplyrun
.
Parameters
For the following tutorial sections, we will gradually build our beir.hm
pipeline.
In any ML pipeline, one starts with data. In this example, we will use the BEIR-14 subset (those with public licenses) of the BEIR benchmark.
We declare these datasets as a pipeline parameter BeirDataset
:
beir_dataset = {BeirDataset:
msmarco scifact trec-covid webis-touche2020
fiqa dbpedia-entity fever nfcorpus hotpotqa
climate-fever scidocs nq quora arguana
}
Note the syntax for declaring a parameter:
{ParamName: key0 key1 ...}
. For each parameter, the first key is considered the default case -- heremsmarco
.
Now we want to download the raw data from their official location.
beir_url_prefix = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets"
We proceed to write the first task of our pipeline: downloading the raw data and unzip it.
task raw_beir_data(beir_dataset=$, beir_url_prefix=$) -> out:
wget -O dataset.zip $beir_url_prefix/$beir_dataset.zip
unzip dataset.zip
rm dataset.zip
mv $beir_dataset out
We declared a task that takes two inputs: beir_dataset
and beir_url_prefix
, and produces an output directory out
.
The syntax
name=$
is a shorthand forname=$name
. Here,beir_dataset=$
introduced thebeir_dataset
parameter as an input to the task.
The task is considered complete when its output directory out
exists after the task exits with a zero status code.
In HyperMake, the success of a task is determined by
- the existence of all of its specified outputs
- AND the zero exit status code of the task script.
Note that
beir_dataset
is a parameterized value: it can take any of the values in theBeirDataset
parameter;beir_url_prefix
is a singleton (non-parameterized) value.
Hence the task raw_beir_data
is parameterized with all the parameters in its inputs. Consider
raw_beir_data
not a single task, but a tensor of tasks, where it has dimension 1: the BeirDataset
parameter.
Invoking the task
Let's invoke this task. To download the msmarco
dataset, we run:
hypermake beir.hm run "raw_beir_data[BeirDataset:msmarco]"
We will find the downloaded dataset in the out/raw_beir_data/default
directory.
Note the task indexing syntax:
task[Param0: key0, Param0: key1, ...]
.
Download another one:
hypermake beir.hm run "raw_beir_data[BeirDataset:scifact]"
We will find the downloaded dataset in the out/raw_beir_data/BeirDataset=scifact
directory.
The output directory will be
out/<task-name>/<non-default-params>
, where<non-default-params>
is the URL percent-encoding of the key-value pairs.
Clearly we do not wish to invoke the downloading one by one. Let's use the wildcard *
:
hypermake beir.hm run "raw_beir_data[BeirDataset: *]" -j8
The
-j
flag specifies the number of parallel jobs to run. Here we run 8 jobs in parallel.
At this point we have downloaded all the datasets. We will proceed to the next steps in the following sections.
Task composition
In the previous section, we saw how to define a task with parameters. In this section, we will see how to compose tasks together to form a pipeline.
To run the BEIR benchmark with the Pyserini package, we envision the following pipeline:
graph LR; raw_beir_data --> beir_to_trec; beir_to_trec --> index; index --> retrieve; beir_to_trec --> retrieve; retrieve --> evaluate; beir_to_trec --> evaluate;
Essentially, our pipeline consists of the following steps:
- Download the raw data in
raw_beir_data
; - Preprocess the data to the standard TREC format in
beir_to_trec
; - Index the data in
index
to create a BM25 index (depending on the preprocesed data); - Retrieve the top-100 documents for each query in
retrieve
(depending on the index and the data); - Evaluate the retrieval results in
evaluate
(depending on the data and the retrieved results).
To compose these tasks, we need to define the dependencies between them. This is done by specifying the output of one task as the input of another task.
task raw_beir_data(beir_dataset=$, beir_url_prefix=$) -> out:
...
task beir_to_trec(data=$raw_beir_data.out) -> out:
...
task index(data=$beir_to_trec.out) -> out:
...
task retrieve(data=$beir_to_trec.out, index=$index.out) -> out:
...
task evaluate(data=$raw_beir_data.out, rseult=$retrieve.out) -> out:
...
In the next sections we will implement these tasks one by one.
Packages
Our previous sketch requires some packages to be built.
- A
conda
package that contains a bunch of Python libraries (mainly Pyserini) to run BM25 search; - The NIST
trec_eval
package to evaluate the retrieval results.
We will define these packages in HyperMake and let them be part of the whole pipeline, so when a user runs the pipeline, the packages will be built and installed automatically.
A package in HyperMake is defined with the package
keyword, and it is a special kind of task.
Creating a Conda package
package pyserini -> out:
mkdir -p $out
conda create -y \
-p $out \
-c conda-forge \
python=3.10 openjdk=21
$out/bin/pip install torch faiss-cpu pyserini
We declared a package named pyserini
that when building, creates a new Conda environment with Python 3.10 and OpenJDK 21, and installs Pyserini in it. Note that we build the package in a HyperMake-managed, separate directory $out
(with the -p
/--prefix
directive of Conda) instead of a global Conda environment.
This is common so that HyperMake provided standard library subroutines to make this easier:
import conda
package pyserini = conda.create(
packages="python=3.10 openjdk=21",
extra_args="-c conda-forge",
extra_pip_packages="torch faiss-cpu pyserini"
)
Building the trec_eval
package
trec_eval
is a C package built with Make. We can define a package for it as well:
package trec_eval -> out:
git clone https://github.com/usnistgov/trec_eval.git $out
cd $out
make
In HyperMake, a package must have exactly 1 output: the built package directory. To refer to the output directory, directly use the package name as a variable (e.g.
$pyserini
,$trec_eval
here).
What is exactly the difference between a package and a task? When a HyperMake pipeline is defined across multiple file systems (e.g. local, AWS, SSH, etc.), a task is only run once and transferred between file systems, while a package is separately built on each file system.
Decorators
Now let's convert our downloaded BEIR data to the standard TREC format. This format is standard for information retrieval tasks. There are 3 kinds of files in the TREC format:
*.queries
, a TSV file with two columns, query id and query text;*.qrels
, a TSV file with four columns, query id, iteration id, document id, and relevance label;corpus
, a TSV file with two columns, document id and document text.
This conversion involves some complex processing, so we will first write a Python script beir_to_trec.py
to do this.
import os
import json
import sys
import csv
from tqdm import tqdm
data = sys.argv[1] # The directory containing the downloaded BEIR data
out = sys.argv[2] # The directory to write the TREC format data
os.mkdir(out)
with open(f"{data}/corpus.jsonl") as f_in, open(f"{out}/corpus", 'w') as f_out:
for line in tqdm(f_in):
obj = json.loads(line)
id = obj['_id']
text = obj['text'].replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
title = obj.get('title', "")
trec_line = f"{id}\t{title}: {text}" if title != "" else f"{id}\t{text}"
# Concatenate title and text
print(trec_line, file=f_out)
queries = {}
with open(f"{data}/queries.jsonl") as f:
for line in tqdm(f):
obj = json.loads(line)
id = obj['_id']
text = obj['text']
queries[id] = text
for partition in os.listdir(f"{data}/qrels"):
partition = os.path.splitext(partition)[0]
with open(f"{data}/qrels/{partition}.tsv") as f_in, open(f"{out}/{partition}.qrels", 'w') as f_out:
query_ids = set()
for row in tqdm(csv.DictReader(f_in, delimiter='\t')):
query_ids.add(row['query-id'])
print(f"{row['query-id']}\t0\t{row['corpus-id']}\t{row['score']}", file=f_out)
with open(f"{out}/{partition}.queries", 'w') as f:
for query_id in query_ids:
print(f"{query_id}\t{queries[query_id]}", file=f)
Now we can write a task to run this script.
task beir_to_trec(data=$raw_beir_data.out) -> out:
python beir_to_trec.py $data out
This task takes the output of the raw_beir_data
task as input and produces a directory out
containing the TREC format data.
But to run this task, before invoking hypermake
from the command line, we need to first activate the Conda environment that contains the Python dependencies required by the script beir_to_trec.py
. This is not ideal -- recall that we just built the pyserini
Conda environment in the previous section. We would like to run this task in the pyserini
environment.
Let's decorate this task with a @conda
decorator that activates the pyserini
environment.
import conda
@conda.activate(environment=$pyserini)
task beir_to_trec(data=$raw_beir_data.out) -> out:
python beir_to_trec.py $data out
What is the magic behind this decorator? A HyperMake decorator takes a script and returns a new wrapped script. To implement your own decorator, you need an object with a
run
function. If you are curious, you can find the implementation of theconda.activate
decorator here.
Next steps
Let's continue building our pipeline, starting from indexing the corpus with Pyserini.
At this step, we run a Bash script under the Pyserini conda environment.
@conda.activate(environment=$pyserini)
task index(data=$beir_to_trec.out) -> out:
mkdir corpus
cat $data/corpus \
| jq -Rc 'inputs | split("\t") | {id: .[0], contents: .[1]}' \
> corpus/corpus.json # Convert TREC format to Pyserini JSON
python -m pyserini.index.lucene \
--collection JsonCollection \
--input corpus \
--index $out \
--generator DefaultLuceneDocumentGenerator \
--threads $(nproc) \
--storePositions \
--storeDocvectors \
--storeRaw
Run the actual retrieving with Pyserini.
@conda.activate(environment=$pyserini)
task retrieve(
data=$beir_to_trec.out,
test_partition=$,
index=$index.out
) -> (out="result.qres"):
ln -s $data/$test_partition.queries test.tsv
python -m pyserini.search.lucene \
--index $index \
--topics test.tsv \
--output $out \
--batch-size 32 \
--hits 100 \
--threads $(nproc) \
--remove-duplicates --remove-query --bm25
Evaluate the retrieval results with trec_eval
.
Step 7: Evaluate the retrieval results with trec_eval
task evaluate(
data=$beir_to_trec.out,
result=$retrieve.out,
test_partition=$,
trec_eval=$
) -> (out="eval.txt"):
$trec_eval/trec_eval -m all_trec $data/$test_partition.qrels $result > $out
Here we referenced the output of the
trec_eval
package as$trec_eval
. This is because thetrec_eval
package is a separate package that we built in the previous section. We can refer to the output of a package directly by its name.
Reduction
At this point we have a pipeline that retrieves documents from a corpus using Pyserini. We have also evaluated the retrieval results with trec_eval
. But there are a lot of runs: one for each dataset in BEIR-14. We would like to aggregate the evaluation results.
This is done by reduction in HyperMake: think this as reduce
in functional programming, or .max(dim=i)
in a tensor processing library.
Recall that at the end of the pipeline we have built thus far, the trec_eval
output files are in the format of eval.txt
for each dataset, under the variable $evaluate.out
. We would like to aggregate the results over all datasets in BEIR-14. Additionally, there is more than 1 metric that we cared: for example, ndcg_cut_10
, recall_100
, map
, and mrr
.
We can define a new task aggregate_metric
that takes the evaluation results of all datasets and aggregates them. The task definition is as follows:
metric = {Metric: ndcg_cut_10 recall_100 map mrr}
task aggregate_metric(
eval_results=$evaluate[BeirDataset: *].out,
metric=$
) -> (out="aggregate.txt"):
grep -E "^$metric " $eval_results/* > $out
Note here that we used $evaluate[BeirDataset: *].out
to refer to the outputs of all evaluate
tasks for each dataset in BEIR-14. The parameter eval_results
, while logically is a dictionary from configurations to files, will be realized as a folder of files for the Shell script.
HyperMake maps a dictionary to a folder of files in the Shell script. This is a common pattern in HyperMake to handle multiple outputs.
dict[key]
would be$d/$key
in Shell.dict.values()
would be$d/*
in Shell.for key in dict
would befor f in $d/*
in Shell.
Plans
We have now built a full pipeline for the BEIR-14 dataset. The pipeline is defined in beir.hm
.
Let's first preview this in the command line:
hypermake beir.hm list
It shows a rendition of the DAG structure of our pipeline:
HyperMake 0.1.0 -- A parameterized pipeline manager
Workflow file: beir.hm
Variables:
• Metric: { ndcg_cut_10 recall_100 map mrr }
• BeirDataset: { msmarco scifact trec-covid webis-touche2020 fiqa dbpedia-entity fever nfcorpus hotpotqa climate-fever scidocs nq quora arguana }
Tasks:
• pyserini@local
│ • raw_beir_data[BeirDataset]
│ │ • trec_eval@local
├─┴─│─• beir_to_trec[BeirDataset]
├───│─┼─• index[BeirDataset]
└───│─┼─┴─• retrieve[BeirDataset]
└─┴───┴─• evaluate[BeirDataset]
└─• aggregate_metric[Metric]
To run them all:
hypermake beir.hm run "aggregate_metric[Metric: *]" -j8
Here we compute the aggregate_metric
task for all metrics defined in metric
, with max 8 jobs running in parallel!
Or, we can define a plan to define the targets we want to run:
plan RunBEIR = {
aggregate_metric[Metric: *]
}
A plan definition can contain multiple targets, separated by commas.
And invoke it with:
hypermake beir.hm run RunBEIR -j8
The results should match the results in Table 2 (BM25 column) of the RepLlama paper.
Tasks
Tasks are the atomic unit in HyperMake pipelines. Each task is realized as a script (default in Bash, but can be in any language with the std.run
decorator.) with a set of inputs and outputs.
The task script is executed in a child process, and the inputs and outputs are passed as environment variables (HyperMake manages these variables). Its working directory is managed by HyperMake, and is located at ${fileSys.root}/$taskName/$taskParams
.
$taskParams
is the percent-encoded string of the set of task parameters that are not default: e.g.Dropout=0.1&Lr=0.01&BatchSize=32
.
Syntax
To define a task, write
task taskName@fsT(
$param1@fsI1=$arg1,
$param2@fsI2=$arg2,
...
) -> ($out1@fsO1, $out@fsO2, ...):
# task script
where
taskName
is the name of the task.fsT
is the file system in which the task is executed. If omitted, the task is executed in thelocal
file system.$param1
,$param2
, ... are the parameters of the task.$arg1
,$arg2
, ... are the input arguments of the task, and has to be specified.- If not, the task will be considered abstract, and it is a function.
$fsI1
,$fsI2
, ... are the file systems that expect the input arguments to be in. If not, HyperMake will automatically transfer the files to the specified file system. If omitted, default to$fsT
.$out1
,$out2
, ... are the output files of the task.- Can be files or directories.
$fsO1
,$fsO2
, ... are the file systems that the output files are in. If omitted, default to$fsT
.
Behavior
To run a task, HyperMake works by
- Checking the cache: If the outputs of the task are already in the cache and are successful, the task is considered up-to-date and is not run.
- Removing the outputs: If the outputs exists but corrupted (e.g. task not finish successfully), the outputs are removed.
- Creating a working directory: HyperMake creates a working directory for the task, at
${fileSys.root}/$taskName/$taskParams
. - Locks this directory: HyperMake locks this directory to prevent other HyperMake instances from running this task.
- Linking the inputs: HyperMake links the input files (outputs of other dependent tasks) to the working directory.
- Running the task script: HyperMake runs the task script in the working directory as a child process.
- Checking the output: The task is considered successfully terminated if the task script exits with a zero exit code and all outputs exists in their specified file systems.
- Unlocking the directory: HyperMake unlocks the working directory.
Functions
Functions in HyperMake are abstract tasks: tasks whose inputs are not fully specified. Functions allow for the instantiation of tasks with different parameters at different locations in the pipeline.
Syntax
def funcName(input1, input2, ...) -> (output1, output2, ...):
# function script
where
funcName
is the name of the function.input1
,input2
, ... are the input arguments of the function.output1
,output2
, ... are the output files of the function.
Instantiation
To instantiate a function as a task, write
task taskName($param1=$arg1, $param2=$arg2, ...) =
funcName(input1=$param1, input2=$param2, ...)
Instantiating a function as a task starts with
=
, not:
(starts a script block).
Module System
Objects
In HyperMake, sometimes it is needed to bundle certain definitions together so that it can be reused. This forms an object in HyperMake (think of it as a singleton object in OO languages):
object my_obj:
key = value
def f(...):
...
task t0(...) -> out = f(...)
Objects can be used as namespaces. To refer to a definition in an object, use the .
sign. For example, to refer to
- the
key
inmy_obj
, write$my_obj.key
; - the
out
output in taskt0
, write$my_obj.t0.out
.
Given a task
a.b.c
, it will be placed in${fileSys.root}/a/b/c
.
Classes
Classes are just abstract objects that can be instantiated with parameters.
class my_class(param0, param1, ...):
key = value
task t0(...) -> out:
...
To instantiate a class, write
object my_obj = my_class(arg0, arg1, ...)
Note that instantiation of an object starts with the keyword
object
. Just doingmy_obj = ...
would define a string-valued literal.
Modules
HyperMake's module system is based on objects: each file, when imported, forms a singleton object. Given the following directory structure:
main.hm
data/
preprocess.hm
In main.hm
, you can import preprocess.hm
as follows:
import data.preprocess
This will create an object data.preprocess
that contains all definitions in preprocess.hm
. Or you can import it as an alias:
import data.preprocess as pp
This will create an object pp
that contains all definitions in preprocess.hm
.
Additionally, you can import a HyperMake script in the current namespace, not as an object:
import "data/preprocess.hm"
Importing a file by its filename will import all definitions in preprocess.hm
into the current namespace.
Decorators
In Hypermake, a task can be decorated with some decorators, effectively modifying its behavior. This can support
- Running with different shell;
- Running in specific virtual environments;
- Running through some cluster submission systems;
- etc.
A decorator in HyperMake is just an object with a run
method that takes a script as input and runs a modified version.
object decorator:
def run(internal_script):
...
If a decorator admits parameters, it simply becomes a class:
class decorator(args):
def run(internal_script):
...
and when applying a decorator, one could write
@decorator(args)
task taskName(...) -> out:
...
Example 1: A decorator that runs a task in Python
An example that let us runs a task in Python instead of shell:
object python:
def run(internal_script):
python $internal_script
@python
task helloWorldInPython:
print("Hello World" + " " + "in Python!")
There is no need to define this in your pipelines: it is already available in the standard library as @std.run(interpreter="python")
.
Example 2: Decorates a script to run in a Conda virtual environment
In Python, a task can be run in different Conda virtual environments. This is a decorator that lets us do that.
class conda(env):
def run(internal_conda_script):
eval "$(command conda 'shell.bash' 'hook' 2> /dev/null)"
conda activate $env
. $internal_conda_script
conda deactivate
@conda(env={Env: base myenv})
task helloWorldFromEnv:
python -c "print('Hello World in Python from $env!')"
Note that in the task helloWorldFromEnv
, the decorator conda
has a parameterized argument: env={Env: base myenv}
.
We can invoke both cases of the task helloWorldFromEnv
:
hypermake tutorial/decorators.hm run 'helloWorldFromEnv[Env: *]'
We will see both lines
Hello World in Python from base!
Hello World in Python from myenv!
output to the terminal.
Example 3: Chaining decorators
We have now created two decorators:
@python
that executes a script using Python instead of Bash as the interpreter;@conda
that runs a task in a specific Conda virtual environment.
Can we compose these decorators? Yes.
@conda(env={Env: base myenv})
@python
task helloWorldInPythonFromEnv:
import os
print(f"Hello World in Python from {os.environ['env']}!")
One can use
os.environ[var]
to get the environment variable$var
in Python. First, our script is wrapped by@python
, then@conda(env)
. Recall that HyperMake passes parameters into the script as environment variables: we cannot use$env
to get the HyperMake variable in Python.
Example 4: A decorator that runs a compiled language: C
We can also create a decorator that runs a task in C. Since C is a compiled language, we need to compile the script first.
object gcc:
def run(internal_c_script):
ln -s $internal_c_script source.c
gcc source.c -o source.out
./source.out
Now we can do fun things: write C scripts in HyperMake!
@gcc
task print(input="abcde"):
#include <stdio.h>
#include <stdlib.h>
int main() {
char* input = getenv("input");
printf("%s\n", input);
return 0;
}
Packages
In HyperMake, packages are special tasks that builds a software package. They can depends on other packages but not tasks, and will be built differently on different environments (see next tutorial).
A package is defined as follows (note that a package can only have exactly 1 output):
package $packageName -> $packageOutputName:
# build script
For example, let's build trec_eval
(a standard information retrieval evaluation toolkit from NIST)
from its C source code:
package trec_eval -> out:
mkdir -p $out
git clone https://github.com/usnistgov/trec_eval $out
cd $out
make
Here we clone the repository into a HyperMake-managed directory $out
, and then run make
to build the package. The binary will be built in $out
.
To refer to this package output, use $trec_eval
(there is no need to specify $trec_eval.out
).
For example, if an evaluation task requires this package, one can write
task eval(trec_eval=$, pred=$, gold=$) -> out:
$trec_eval/trec_eval $gold $pred > $out
Example 1: Copying a package from a local directory
package pack1 -> out:
ln -s $localDir $out
This behavior can be written as
import std
package pack1 = std.symlink(path=$localDir)
Example 2: Cloning from a remote repository and build it
package pack2(repo=$) -> out:
git clone $repo out
cd out
make
Example 3: Creates a Conda environment from a Python package
package pack3(pythonPackage=$) -> out:
mkdir -p $out
conda env create -p $out -f $pythonPackage/environment.yml
File systems
A file system encapsulates the operations that can be performed on files and directories in a particular environment in HyperMake.
HyperMake provides a default file system implementation for the local file system (local
),
and has utilities to define file systems over common remote systems such as SFTP, AWS S3, and Azure Blob Storage.
Additionally, it is possible to define custom file systems for different environments.
In HyperMake, a file system is an object with various member functions defined.
Functions in a file system object
Member | Description |
---|---|
fs.root | A string specifying the root path of all HyperMake outputs. |
fs.read(file) | Reads the file $file and outputs the content to stdout . |
fs.mkdir(dir) | Creates an empty directory $dir . This should have the semantics of mkdir -p : it should create all parent directories if they do not exist, and it should not fail if the directory already exists. |
fs.exists(file) | Checks if $file exists in fs . |
fs.link(src, dst) | Creates a symbolic link at $dst that links to $src . |
fs.touch(file) | Creates an empty file at path `$file |
fs.remove(file) | Removes file $file in fs . If $file is a directory, it should remove the directory and all its contents. |
fs.upload(src, dst) | Uploads the file or directory $src in local to $dst in fs . |
fs.download(src, dst) | Downloads the file or directory $src in fs to $dst in local . |
fs.execute(command) | (Optional) Executes the command $command in fs 's shell. This can be omitted if the file system does not support running commands. |
There is no need to define local
as it is internal to HyperMake. A reference implementation of local
is provided below.
object local:
root = "."
def read(file):
cat $file
def mkdir(dir):
mkdir -p $dir
def exists(file):
test -e $file
def link(src, dst):
ln -s $src $dst
def touch(file):
touch $file
def remove(file):
rm -r $file
def upload(src, dst):
ln -s $src $dst # both local, so a symbolic link suffices
def download(src, dst):
ln -s $src $dst # both local, so a symbolic link suffices
def execute(command):
bash -e $command
Example: define a file system over SFTP
import ssh
object my_server = ssh.server(host="...")
Example: define a file system over AWS S3
import aws
object my_bucket = aws.s3(name="...")
Example: define a file system over Azure Blob Storage
import az
object my_container = az.storage_blob(name="...")
Transferring file between environments
Sometimes different parts of a pipeline are run under different environments, e.g., data preprocessing may happen on a local machine, whereas training is done on an SSH grid, or on AWS EC2 or Azure ML.
Module std
Contains some miscellaneous utilities for HyperMake.
Function std.symlink
Creates a symbolic link as an output. This is particularly useful when referring to a local repository that is under development.
import std
package my_repo = std.symlink(path="path/to/my/repo")
Class std.run
Enables a task in HyperMake to run in a custom interpreter (e.g. Python, Perl, etc.).
Example usage:
import std
sender = {Sender: Alice Bob}
@std.run(interpreter="python3")
task hello_world(sender=$):
import os
print(f"Hello, world from {os.environ["sender"]}!")
Note that whatever interpreter you choose to use, HyperMake parameters are passed into the task as environment variables. Here in Python we use os.environ
to access them.
Module aws
aws.s3
Enables AWS S3 buckets as a HyperMake file system. Behind the scenes it uses the aws s3
CLI command family.
Example usage:
import aws
object my_bucket = aws.s3(
bucket="my_bucket",
root=""
)
Module az
Enables various decorators for Microsoft Azure services in HyperMake.
az.storage_blob
Enables Azure Blob Storage containers to be used as a file system in HyperMake. Behind the scenes it uses the az storage blob
CLI command family.
Example usage:
import az
object az_storage = az.storage_blob(
container="my_container",
extra_args="--account-name xxx --account-key yyy"
)
data_path = "/path/to/data"@az_storage
az.storage_fs
Enables Azure Data Lake Storage (ADLS) Gen2 containers to be used as a file system in HyperMake. Behind the scenes it uses the az storage fs
CLI command family.
az.ml_job_create
Enables Azure ML command jobs as a submitter in HyperMake. Behind the scenes it uses the az ml job
CLI command family.
Module conda
Enables Conda environments to be used as decorators in HyperMake.
Function conda.create_env
Creates a Conda environment based on a yaml specification file.
package env = conda.create_env(file="environment.yml")
Class conda.activate
Enables a job to be run within a Conda environment.
import conda
@conda.activate(environment="myenv")
task check_if_cuda_is_available():
python -c "import torch; print(torch.cuda.is_available())"
You can use the returned path of conda.create_env
as the environment
argument.
package env = conda.create_env(file="environment.yml")
@conda.activate(environment=$env)
This can even be expressed with nested decorators:
import std
import conda
@conda.activate(environment="myenv")
@std.run(interpreter="python")
task check_if_cuda_is_available():
import torch
print(torch.cuda.is_available())
Here we first wrap the script with a python
interpreter, then dictate that this task should run within a Conda environment.
Module ssh
Enables SSH servers to be used as file systems in HyperMake.
ssh.Server
Defines a SSH server in HyperMake. Note that this file system is able to execute jobs.
Example:
import ssh
object my_server = ssh.server(
host='192.168.0.7', # host name, in ~/.ssh/config
root='/home/user/out' # root of HyperMake output on the remote server
)
task my_remote_task@my_server(input@server) -> output@my_server:
# This task will be executed on the remote server
# and the input will be copied to the remote server.
# The output is expected to appear on the remote server.
...