HyperMake

HyperMake is a parameterized pipeline definition language (think of a make where tasks can be parameterized) heavily inspired by Ducttape.

  • Shell scripting: Write tasks in plain Bash, just like make. No Python or YAML for defining tasks.
  • Cached intermediate results: Intermediate results are cached, so that if a task fails, it can be re-run from the last successful task.
  • Parameterization of tasks: a task can be of multiple versions (e.g. in a ML pipeline, a task can be of different hyperparameters).
  • Minimal juggling: Inputs and outputs are just files/symlinks, and arguments are all passed as environment variables. Tasks are realized as a child process.
  • Automatic parallelization: based on the dependency DAG.
  • Cloud-agnostic: tasks can be run locally, or on a cloud (e.g. AWS, Azure), with minimal changes to the pipeline.

Installation

Via Homebrew

HyperMake can be installed via brew with a custom tap.

brew tap ctongfei/repo
brew install --HEAD ctongfei/repo/hypermake

Right now the Homebrew formula is configured to use the HEAD version of HyperMake, which is the latest version on the main branch. To reinstall if the latest version changed, do

brew reinstall ctongfei/repo/hypermake

Building from source

HyperMake can also be directly built from source. It requires sbt to build and JDK 8+ to run.

git clone https://github.com/ctongfei/hypermake
cd hypermake

make
make install

This will build the HyperMake binary and install it locally to $HOME/.local/bin/hypermake. To install it elsewhere, simply modify the $PREFIX variable in the Makefile.

In the following sections we will first glimpse into HyperMake by running a simple "Hello, world" task.

Then, we will gradually introduce more advanced features of HyperMake to build a pipeline for running the BEIR (paper) benchmark1.

1

BEIR is a robust and heterogeneous evaluation benchmark for zero-shot information retrieval. It includes a diverse set of retrieval tasks, such as web search, question answering, and entity retrieval. The benchmark is designed to evaluate the generalization capabilities of retrieval models across different tasks and domains.

Hello, world!

To introduce HyperMake, let's define our first task:

task hello:
  echo "Hello, world!"

Save this file as hello.hm.

We have created our first HyperMake script file that contains a single task. This script defines a task named hello that prints Hello, world! to the console. There is no input or output for this task.

Note the syntax here: A code block starts after the : at the end of the task signature. A code block is a consecutive list of indented lines of scripts, where each line must start with at least 2 spaces. By default, the script is written in Bash.

If you are familiar with make, you can think of a task as a make rule. The task above written as a Makefile would just be

hello:
    echo "Hello, world!"

.PHONY: hello  # since this task does not produce any output

Now let's run this task!

Execute the following command in your shell:

  hypermake hello.hm run hello 

We should see the output "Hello, world!" printed in the terminal.

The basic command line usage is hypermake $script <subtask> $target. Here the <subtask> is simply run.

Parameters

For the following tutorial sections, we will gradually build our beir.hm pipeline.

In any ML pipeline, one starts with data. In this example, we will use the BEIR-14 subset (those with public licenses) of the BEIR benchmark.

We declare these datasets as a pipeline parameter BeirDataset:

beir_dataset = {BeirDataset: 
    msmarco scifact trec-covid webis-touche2020 
    fiqa dbpedia-entity fever nfcorpus hotpotqa 
    climate-fever scidocs nq quora arguana
}

Note the syntax for declaring a parameter: {ParamName: key0 key1 ...}. For each parameter, the first key is considered the default case -- here msmarco.

Now we want to download the raw data from their official location.

beir_url_prefix = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets"

We proceed to write the first task of our pipeline: downloading the raw data and unzip it.

task raw_beir_data(beir_dataset=$, beir_url_prefix=$) -> out:
  wget -O dataset.zip $beir_url_prefix/$beir_dataset.zip
  unzip dataset.zip
  rm dataset.zip
  mv $beir_dataset out

We declared a task that takes two inputs: beir_dataset and beir_url_prefix, and produces an output directory out.

The syntax name=$ is a shorthand for name=$name. Here, beir_dataset=$ introduced the beir_dataset parameter as an input to the task.

The task is considered complete when its output directory out exists after the task exits with a zero status code.

In HyperMake, the success of a task is determined by

  • the existence of all of its specified outputs
  • AND the zero exit status code of the task script.

Note that

  • beir_dataset is a parameterized value: it can take any of the values in the BeirDataset parameter;
  • beir_url_prefix is a singleton (non-parameterized) value.

Hence the task raw_beir_data is parameterized with all the parameters in its inputs. Consider raw_beir_data not a single task, but a tensor of tasks, where it has dimension 1: the BeirDataset parameter.

Invoking the task

Let's invoke this task. To download the msmarco dataset, we run:

hypermake beir.hm run "raw_beir_data[BeirDataset:msmarco]"

We will find the downloaded dataset in the out/raw_beir_data/default directory.

Note the task indexing syntax: task[Param0: key0, Param0: key1, ...].

Download another one:

hypermake beir.hm run "raw_beir_data[BeirDataset:scifact]"

We will find the downloaded dataset in the out/raw_beir_data/BeirDataset=scifact directory.

The output directory will be out/<task-name>/<non-default-params>, where <non-default-params> is the URL percent-encoding of the key-value pairs.

Clearly we do not wish to invoke the downloading one by one. Let's use the wildcard *:

hypermake beir.hm run "raw_beir_data[BeirDataset: *]" -j8

The -j flag specifies the number of parallel jobs to run. Here we run 8 jobs in parallel.

At this point we have downloaded all the datasets. We will proceed to the next steps in the following sections.

Task composition

In the previous section, we saw how to define a task with parameters. In this section, we will see how to compose tasks together to form a pipeline.

To run the BEIR benchmark with the Pyserini package, we envision the following pipeline:

graph LR;
raw_beir_data --> beir_to_trec;
beir_to_trec --> index;
index --> retrieve;
beir_to_trec --> retrieve;
retrieve --> evaluate;
beir_to_trec --> evaluate;

Essentially, our pipeline consists of the following steps:

  • Download the raw data in raw_beir_data;
  • Preprocess the data to the standard TREC format in beir_to_trec;
  • Index the data in index to create a BM25 index (depending on the preprocesed data);
  • Retrieve the top-100 documents for each query in retrieve (depending on the index and the data);
  • Evaluate the retrieval results in evaluate (depending on the data and the retrieved results).

To compose these tasks, we need to define the dependencies between them. This is done by specifying the output of one task as the input of another task.

task raw_beir_data(beir_dataset=$, beir_url_prefix=$) -> out:
  ...
  
task beir_to_trec(data=$raw_beir_data.out) -> out:
  ...

task index(data=$beir_to_trec.out) -> out:
  ...

task retrieve(data=$beir_to_trec.out, index=$index.out) -> out:
  ...

task evaluate(data=$raw_beir_data.out, rseult=$retrieve.out) -> out:
  ...

In the next sections we will implement these tasks one by one.

Packages

Our previous sketch requires some packages to be built.

  • A conda package that contains a bunch of Python libraries (mainly Pyserini) to run BM25 search;
  • The NIST trec_eval package to evaluate the retrieval results.

We will define these packages in HyperMake and let them be part of the whole pipeline, so when a user runs the pipeline, the packages will be built and installed automatically.

A package in HyperMake is defined with the package keyword, and it is a special kind of task.

Creating a Conda package

package pyserini -> out:
  mkdir -p $out
  conda create -y \
    -p $out \
    -c conda-forge \
    python=3.10 openjdk=21
  $out/bin/pip install torch faiss-cpu pyserini

We declared a package named pyserini that when building, creates a new Conda environment with Python 3.10 and OpenJDK 21, and installs Pyserini in it. Note that we build the package in a HyperMake-managed, separate directory $out (with the -p/--prefix directive of Conda) instead of a global Conda environment.

This is common so that HyperMake provided standard library subroutines to make this easier:

import conda
package pyserini = conda.create(
    packages="python=3.10 openjdk=21",
    extra_args="-c conda-forge",
    extra_pip_packages="torch faiss-cpu pyserini"
)

Building the trec_eval package

trec_eval is a C package built with Make. We can define a package for it as well:

package trec_eval -> out:
  git clone https://github.com/usnistgov/trec_eval.git $out
  cd $out
  make

In HyperMake, a package must have exactly 1 output: the built package directory. To refer to the output directory, directly use the package name as a variable (e.g. $pyserini, $trec_eval here).

What is exactly the difference between a package and a task? When a HyperMake pipeline is defined across multiple file systems (e.g. local, AWS, SSH, etc.), a task is only run once and transferred between file systems, while a package is separately built on each file system.

Decorators

Now let's convert our downloaded BEIR data to the standard TREC format. This format is standard for information retrieval tasks. There are 3 kinds of files in the TREC format:

  • *.queries, a TSV file with two columns, query id and query text;
  • *.qrels, a TSV file with four columns, query id, iteration id, document id, and relevance label;
  • corpus, a TSV file with two columns, document id and document text.

This conversion involves some complex processing, so we will first write a Python script beir_to_trec.py to do this.

import os
import json
import sys
import csv
from tqdm import tqdm
data = sys.argv[1]  # The directory containing the downloaded BEIR data
out = sys.argv[2]  # The directory to write the TREC format data
os.mkdir(out)
with open(f"{data}/corpus.jsonl") as f_in, open(f"{out}/corpus", 'w') as f_out:
  for line in tqdm(f_in):
    obj = json.loads(line)
    id = obj['_id']
    text = obj['text'].replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
    title = obj.get('title', "")
    trec_line = f"{id}\t{title}: {text}" if title != "" else f"{id}\t{text}" 
    # Concatenate title and text
    print(trec_line, file=f_out)
queries = {}
with open(f"{data}/queries.jsonl") as f:
  for line in tqdm(f):
    obj = json.loads(line)
    id = obj['_id']
    text = obj['text']
    queries[id] = text
for partition in os.listdir(f"{data}/qrels"):
  partition = os.path.splitext(partition)[0]
  with open(f"{data}/qrels/{partition}.tsv") as f_in, open(f"{out}/{partition}.qrels", 'w') as f_out:
    query_ids = set()
    for row in tqdm(csv.DictReader(f_in, delimiter='\t')):
      query_ids.add(row['query-id'])
      print(f"{row['query-id']}\t0\t{row['corpus-id']}\t{row['score']}", file=f_out)
  with open(f"{out}/{partition}.queries", 'w') as f:
    for query_id in query_ids:
      print(f"{query_id}\t{queries[query_id]}", file=f)

Now we can write a task to run this script.

task beir_to_trec(data=$raw_beir_data.out) -> out:
  python beir_to_trec.py $data out

This task takes the output of the raw_beir_data task as input and produces a directory out containing the TREC format data.

But to run this task, before invoking hypermake from the command line, we need to first activate the Conda environment that contains the Python dependencies required by the script beir_to_trec.py. This is not ideal -- recall that we just built the pyserini Conda environment in the previous section. We would like to run this task in the pyserini environment.

Let's decorate this task with a @conda decorator that activates the pyserini environment.

import conda

@conda.activate(environment=$pyserini)
task beir_to_trec(data=$raw_beir_data.out) -> out:
    python beir_to_trec.py $data out

What is the magic behind this decorator? A HyperMake decorator takes a script and returns a new wrapped script. To implement your own decorator, you need an object with a run function. If you are curious, you can find the implementation of the conda.activate decorator here.

Next steps

Let's continue building our pipeline, starting from indexing the corpus with Pyserini.

At this step, we run a Bash script under the Pyserini conda environment.

@conda.activate(environment=$pyserini)
task index(data=$beir_to_trec.out) -> out:
  mkdir corpus
  cat $data/corpus \
    | jq -Rc 'inputs | split("\t") | {id: .[0], contents: .[1]}' \
    > corpus/corpus.json  # Convert TREC format to Pyserini JSON
  python -m pyserini.index.lucene \
    --collection JsonCollection \
    --input corpus \
    --index $out \
    --generator DefaultLuceneDocumentGenerator \
    --threads $(nproc) \
    --storePositions \
    --storeDocvectors \
    --storeRaw

Run the actual retrieving with Pyserini.

@conda.activate(environment=$pyserini)
task retrieve(
  data=$beir_to_trec.out, 
  test_partition=$, 
  index=$index.out
) -> (out="result.qres"):
  ln -s $data/$test_partition.queries test.tsv
  python -m pyserini.search.lucene \
    --index $index \
    --topics test.tsv \
    --output $out \
    --batch-size 32 \
    --hits 100 \
    --threads $(nproc) \
    --remove-duplicates --remove-query --bm25

Evaluate the retrieval results with trec_eval.

Step 7: Evaluate the retrieval results with trec_eval

task evaluate(
  data=$beir_to_trec.out,
  result=$retrieve.out,
  test_partition=$,
  trec_eval=$
) -> (out="eval.txt"):
  $trec_eval/trec_eval -m all_trec $data/$test_partition.qrels $result > $out

Here we referenced the output of the trec_eval package as $trec_eval. This is because the trec_eval package is a separate package that we built in the previous section. We can refer to the output of a package directly by its name.

Reduction

At this point we have a pipeline that retrieves documents from a corpus using Pyserini. We have also evaluated the retrieval results with trec_eval. But there are a lot of runs: one for each dataset in BEIR-14. We would like to aggregate the evaluation results.

This is done by reduction in HyperMake: think this as reduce in functional programming, or .max(dim=i) in a tensor processing library.

Recall that at the end of the pipeline we have built thus far, the trec_eval output files are in the format of eval.txt for each dataset, under the variable $evaluate.out. We would like to aggregate the results over all datasets in BEIR-14. Additionally, there is more than 1 metric that we cared: for example, ndcg_cut_10, recall_100, map, and mrr.

We can define a new task aggregate_metric that takes the evaluation results of all datasets and aggregates them. The task definition is as follows:

metric = {Metric: ndcg_cut_10 recall_100 map mrr}

task aggregate_metric(
  eval_results=$evaluate[BeirDataset: *].out, 
  metric=$
) -> (out="aggregate.txt"):
  grep -E "^$metric " $eval_results/* > $out

Note here that we used $evaluate[BeirDataset: *].out to refer to the outputs of all evaluate tasks for each dataset in BEIR-14. The parameter eval_results, while logically is a dictionary from configurations to files, will be realized as a folder of files for the Shell script.

HyperMake maps a dictionary to a folder of files in the Shell script. This is a common pattern in HyperMake to handle multiple outputs.

  • dict[key] would be $d/$key in Shell.
  • dict.values() would be $d/* in Shell.
  • for key in dict would be for f in $d/* in Shell.

Plans

We have now built a full pipeline for the BEIR-14 dataset. The pipeline is defined in beir.hm.

Let's first preview this in the command line:

hypermake beir.hm list

It shows a rendition of the DAG structure of our pipeline:

HyperMake 0.1.0 -- A parameterized pipeline manager
Workflow file: beir.hm

Variables:
  • Metric: { ndcg_cut_10 recall_100 map mrr }
  • BeirDataset: { msmarco scifact trec-covid webis-touche2020 fiqa dbpedia-entity fever nfcorpus hotpotqa climate-fever scidocs nq quora arguana }

Tasks:
  • pyserini@local
  │ • raw_beir_data[BeirDataset]
  │ │ • trec_eval@local
  ├─┴─│─• beir_to_trec[BeirDataset]
  ├───│─┼─• index[BeirDataset]
  └───│─┼─┴─• retrieve[BeirDataset]
      └─┴───┴─• evaluate[BeirDataset]
              └─• aggregate_metric[Metric]

To run them all:

hypermake beir.hm run "aggregate_metric[Metric: *]" -j8

Here we compute the aggregate_metric task for all metrics defined in metric, with max 8 jobs running in parallel!

Or, we can define a plan to define the targets we want to run:

plan RunBEIR = {
    aggregate_metric[Metric: *]
}

A plan definition can contain multiple targets, separated by commas.

And invoke it with:

hypermake beir.hm run RunBEIR -j8

The results should match the results in Table 2 (BM25 column) of the RepLlama paper.

Tasks

Tasks are the atomic unit in HyperMake pipelines. Each task is realized as a script (default in Bash, but can be in any language with the std.run decorator.) with a set of inputs and outputs.

The task script is executed in a child process, and the inputs and outputs are passed as environment variables (HyperMake manages these variables). Its working directory is managed by HyperMake, and is located at ${fileSys.root}/$taskName/$taskParams.

$taskParams is the percent-encoded string of the set of task parameters that are not default: e.g. Dropout=0.1&Lr=0.01&BatchSize=32.

Syntax

To define a task, write

task taskName@fsT(
  $param1@fsI1=$arg1, 
  $param2@fsI2=$arg2, 
  ...
) -> ($out1@fsO1, $out@fsO2, ...):
  # task script

where

  • taskName is the name of the task.
  • fsT is the file system in which the task is executed. If omitted, the task is executed in the local file system.
  • $param1, $param2, ... are the parameters of the task.
  • $arg1, $arg2, ... are the input arguments of the task, and has to be specified.
    • If not, the task will be considered abstract, and it is a function.
  • $fsI1, $fsI2, ... are the file systems that expect the input arguments to be in. If not, HyperMake will automatically transfer the files to the specified file system. If omitted, default to $fsT.
  • $out1, $out2, ... are the output files of the task.
    • Can be files or directories.
  • $fsO1, $fsO2, ... are the file systems that the output files are in. If omitted, default to $fsT.

Behavior

To run a task, HyperMake works by

  • Checking the cache: If the outputs of the task are already in the cache and are successful, the task is considered up-to-date and is not run.
  • Removing the outputs: If the outputs exists but corrupted (e.g. task not finish successfully), the outputs are removed.
  • Creating a working directory: HyperMake creates a working directory for the task, at ${fileSys.root}/$taskName/$taskParams.
  • Locks this directory: HyperMake locks this directory to prevent other HyperMake instances from running this task.
  • Linking the inputs: HyperMake links the input files (outputs of other dependent tasks) to the working directory.
  • Running the task script: HyperMake runs the task script in the working directory as a child process.
  • Checking the output: The task is considered successfully terminated if the task script exits with a zero exit code and all outputs exists in their specified file systems.
  • Unlocking the directory: HyperMake unlocks the working directory.

Functions

Functions in HyperMake are abstract tasks: tasks whose inputs are not fully specified. Functions allow for the instantiation of tasks with different parameters at different locations in the pipeline.

Syntax

def funcName(input1, input2, ...) -> (output1, output2, ...):
    # function script

where

  • funcName is the name of the function.
  • input1, input2, ... are the input arguments of the function.
  • output1, output2, ... are the output files of the function.

Instantiation

To instantiate a function as a task, write

task taskName($param1=$arg1, $param2=$arg2, ...) = 
  funcName(input1=$param1, input2=$param2, ...)

Instantiating a function as a task starts with =, not : (starts a script block).

Module System

Objects

In HyperMake, sometimes it is needed to bundle certain definitions together so that it can be reused. This forms an object in HyperMake (think of it as a singleton object in OO languages):

object my_obj:
  key = value
  def f(...):
    ...
  
  task t0(...) -> out = f(...)

Objects can be used as namespaces. To refer to a definition in an object, use the . sign. For example, to refer to

  • the key in my_obj, write $my_obj.key
  • the out output in task t0, write $my_obj.t0.out.

Given a task a.b.c, it will be placed in ${fileSys.root}/a/b/c.

Classes

Classes are just abstract objects that can be instantiated with parameters.

class my_class(param0, param1, ...):
  key = value
  task t0(...) -> out:
    ...

To instantiate a class, write

object my_obj = my_class(arg0, arg1, ...)

Note that instantiation of an object starts with the keyword object. Just doing my_obj = ... would define a string-valued literal.

Modules

HyperMake's module system is based on objects: each file, when imported, forms a singleton object. Given the following directory structure:

  • main.hm
  • data/
    • preprocess.hm

In main.hm, you can import preprocess.hm as follows:

import data.preprocess

This will create an object data.preprocess that contains all definitions in preprocess.hm. Or you can import it as an alias:

import data.preprocess as pp

This will create an object pp that contains all definitions in preprocess.hm.

Additionally, you can import a HyperMake script in the current namespace, not as an object:

import "data/preprocess.hm"

Importing a file by its filename will import all definitions in preprocess.hm into the current namespace.

Decorators

In Hypermake, a task can be decorated with some decorators, effectively modifying its behavior. This can support

  • Running with different shell;
  • Running in specific virtual environments;
  • Running through some cluster submission systems;
  • etc.

A decorator in HyperMake is just an object with a run method that takes a script as input and runs a modified version.

object decorator:
    def run(internal_script):
        ...

If a decorator admits parameters, it simply becomes a class:

class decorator(args):
    def run(internal_script):
        ...

and when applying a decorator, one could write

@decorator(args)
task taskName(...) -> out:
  ...

Example 1: A decorator that runs a task in Python

An example that let us runs a task in Python instead of shell:

object python:
  def run(internal_script):
    python $internal_script

@python
task helloWorldInPython:
  print("Hello World" + " " + "in Python!")

There is no need to define this in your pipelines: it is already available in the standard library as @std.run(interpreter="python").

Example 2: Decorates a script to run in a Conda virtual environment

In Python, a task can be run in different Conda virtual environments. This is a decorator that lets us do that.

class conda(env):
  def run(internal_conda_script):
    eval "$(command conda 'shell.bash' 'hook' 2> /dev/null)"
    conda activate $env
    . $internal_conda_script
    conda deactivate

@conda(env={Env: base myenv})
task helloWorldFromEnv:
  python -c "print('Hello World in Python from $env!')"

Note that in the task helloWorldFromEnv, the decorator conda has a parameterized argument: env={Env: base myenv}. We can invoke both cases of the task helloWorldFromEnv:

hypermake tutorial/decorators.hm run 'helloWorldFromEnv[Env: *]'

We will see both lines

Hello World in Python from base!
Hello World in Python from myenv!

output to the terminal.

Example 3: Chaining decorators

We have now created two decorators:

  • @python that executes a script using Python instead of Bash as the interpreter;
  • @conda that runs a task in a specific Conda virtual environment.

Can we compose these decorators? Yes.

@conda(env={Env: base myenv})
@python
task helloWorldInPythonFromEnv:
  import os
  print(f"Hello World in Python from {os.environ['env']}!")

One can use os.environ[var] to get the environment variable $var in Python. First, our script is wrapped by @python, then @conda(env). Recall that HyperMake passes parameters into the script as environment variables: we cannot use $env to get the HyperMake variable in Python.

Example 4: A decorator that runs a compiled language: C

We can also create a decorator that runs a task in C. Since C is a compiled language, we need to compile the script first.

object gcc:
  def run(internal_c_script):
    ln -s $internal_c_script source.c
    gcc source.c -o source.out
    ./source.out

Now we can do fun things: write C scripts in HyperMake!

@gcc
task print(input="abcde"):
  #include <stdio.h>
  #include <stdlib.h>
  int main() {
    char* input = getenv("input");
    printf("%s\n", input);
    return 0;
  }

Packages

In HyperMake, packages are special tasks that builds a software package. They can depends on other packages but not tasks, and will be built differently on different environments (see next tutorial).

A package is defined as follows (note that a package can only have exactly 1 output):

package $packageName -> $packageOutputName:
  # build script

For example, let's build trec_eval (a standard information retrieval evaluation toolkit from NIST) from its C source code:

package trec_eval -> out:
  mkdir -p $out
  git clone https://github.com/usnistgov/trec_eval $out
  cd $out
  make

Here we clone the repository into a HyperMake-managed directory $out, and then run make to build the package. The binary will be built in $out.

To refer to this package output, use $trec_eval (there is no need to specify $trec_eval.out). For example, if an evaluation task requires this package, one can write

task eval(trec_eval=$, pred=$, gold=$) -> out:
  $trec_eval/trec_eval $gold $pred > $out

Example 1: Copying a package from a local directory

package pack1 -> out:
  ln -s $localDir $out

This behavior can be written as

import std
package pack1 = std.symlink(path=$localDir)

Example 2: Cloning from a remote repository and build it

package pack2(repo=$) -> out:
  git clone $repo out
  cd out
  make

Example 3: Creates a Conda environment from a Python package

package pack3(pythonPackage=$) -> out:
  mkdir -p $out
  conda env create -p $out -f $pythonPackage/environment.yml

File systems

A file system encapsulates the operations that can be performed on files and directories in a particular environment in HyperMake.

HyperMake provides a default file system implementation for the local file system (local), and has utilities to define file systems over common remote systems such as SFTP, AWS S3, and Azure Blob Storage.

Additionally, it is possible to define custom file systems for different environments.

In HyperMake, a file system is an object with various member functions defined.

Functions in a file system object

MemberDescription
fs.rootA string specifying the root path of all HyperMake outputs.
fs.read(file)Reads the file $file and outputs the content to stdout.
fs.mkdir(dir)Creates an empty directory $dir.
This should have the semantics of mkdir -p: it should create all parent
directories if they do not exist, and it should not fail if the directory
already exists.
fs.exists(file)Checks if $file exists in fs.
fs.link(src, dst)Creates a symbolic link at $dst that links to $src.
fs.touch(file)Creates an empty file at path `$file
fs.remove(file)Removes file $file in fs.
If $file is a directory, it should remove the directory and all its contents.
fs.upload(src, dst)Uploads the file or directory $src in local to $dst in fs.
fs.download(src, dst)Downloads the file or directory $src in fs to $dst in local.
fs.execute(command)(Optional) Executes the command $command in fs's shell.
This can be omitted if the file system does not support running commands.

There is no need to define local as it is internal to HyperMake. A reference implementation of local is provided below.

object local:
    root = "."
    
    def read(file):
        cat $file
    
    def mkdir(dir):
        mkdir -p $dir
    
    def exists(file):
        test -e $file
    
    def link(src, dst):
        ln -s $src $dst
    
    def touch(file):
        touch $file
    
    def remove(file):
        rm -r $file
       
    def upload(src, dst):
        ln -s $src $dst  # both local, so a symbolic link suffices
    
    def download(src, dst):
        ln -s $src $dst  # both local, so a symbolic link suffices
        
    def execute(command):
        bash -e $command

Example: define a file system over SFTP

import ssh
object my_server = ssh.server(host="...")

Example: define a file system over AWS S3

import aws
object my_bucket = aws.s3(name="...")

Example: define a file system over Azure Blob Storage

import az
object my_container = az.storage_blob(name="...")

Transferring file between environments

Sometimes different parts of a pipeline are run under different environments, e.g., data preprocessing may happen on a local machine, whereas training is done on an SSH grid, or on AWS EC2 or Azure ML.

Module std

Contains some miscellaneous utilities for HyperMake.

Creates a symbolic link as an output. This is particularly useful when referring to a local repository that is under development.

import std
package my_repo = std.symlink(path="path/to/my/repo")

Class std.run

Enables a task in HyperMake to run in a custom interpreter (e.g. Python, Perl, etc.).

Example usage:

import std

sender = {Sender: Alice Bob}

@std.run(interpreter="python3")
task hello_world(sender=$):
    import os
    print(f"Hello, world from {os.environ["sender"]}!")

Note that whatever interpreter you choose to use, HyperMake parameters are passed into the task as environment variables. Here in Python we use os.environ to access them.

Module aws

aws.s3

Enables AWS S3 buckets as a HyperMake file system. Behind the scenes it uses the aws s3 CLI command family.

Example usage:

import aws
object my_bucket = aws.s3(
    bucket="my_bucket",
    root=""
)

Module az

Enables various decorators for Microsoft Azure services in HyperMake.

az.storage_blob

Enables Azure Blob Storage containers to be used as a file system in HyperMake. Behind the scenes it uses the az storage blob CLI command family.

Example usage:

import az 
object az_storage = az.storage_blob( 
    container="my_container", 
    extra_args="--account-name xxx --account-key yyy"
)

data_path = "/path/to/data"@az_storage

az.storage_fs

Enables Azure Data Lake Storage (ADLS) Gen2 containers to be used as a file system in HyperMake. Behind the scenes it uses the az storage fs CLI command family.

az.ml_job_create

Enables Azure ML command jobs as a submitter in HyperMake. Behind the scenes it uses the az ml job CLI command family.

Module conda

Enables Conda environments to be used as decorators in HyperMake.

Function conda.create_env

Creates a Conda environment based on a yaml specification file.

package env = conda.create_env(file="environment.yml")

Class conda.activate

Enables a job to be run within a Conda environment.

import conda

@conda.activate(environment="myenv")
task check_if_cuda_is_available():
    python -c "import torch; print(torch.cuda.is_available())"

You can use the returned path of conda.create_env as the environment argument.

package env = conda.create_env(file="environment.yml")
@conda.activate(environment=$env)

This can even be expressed with nested decorators:

import std
import conda

@conda.activate(environment="myenv")
@std.run(interpreter="python")
task check_if_cuda_is_available():
    import torch
    print(torch.cuda.is_available())

Here we first wrap the script with a python interpreter, then dictate that this task should run within a Conda environment.

Module ssh

Enables SSH servers to be used as file systems in HyperMake.

ssh.Server

Defines a SSH server in HyperMake. Note that this file system is able to execute jobs.

Example:

import ssh
object my_server = ssh.server(
    host='192.168.0.7',    # host name, in ~/.ssh/config
    root='/home/user/out'  # root of HyperMake output on the remote server
)

task my_remote_task@my_server(input@server) -> output@my_server:
    # This task will be executed on the remote server
    # and the input will be copied to the remote server.
    # The output is expected to appear on the remote server.
    ...