Workflow Management

Motivation

Running one calculation usually isn't too hard. Running a dozen usually isn't too bad either. But once you start needing to run hundreds or even thousands of calculations in parallel, things become very complex very quickly. For instance, you might want to monitor job progress, relaunch failed jobs, ensure that certain steps of a workflow don't run before a prior step completes, automatically store the results in a database, and so much more. All of this requires the use of a workflow management system.
There are over 300 workflow management tools out there. See the  Workflows Community Initiative  and the " Existing Workflow Systems " resource if you're curious.

Our Needs

Our needs when it comes to a workflow management system are as follows:
It can't rely on the compute nodes having network connectivity, as Princeton HPC doesn't support this.
It must support popular HPC job scheduling systems, particularly Slurm.
It must support dynamic workflows, where jobs can spawn other jobs.
It must support the  pilot job model  so long queue times can be avoided.
It should be Pythonic, broadly defined.
It should not require substantial changes to how you would normally write the underlying functions. In other words, it should be relatively simple to go from a standard function to a high-throughput workflow.
It should have support for job monitoring, tracking errors, and re-dispatching any failed workflows.
It should easily support distributed computing environments.
It should be actively developed and maintained.

Our Workflow Infrastructure

With the above information in mind, our group currently uses  Prefect  for small-to-moderate scale campaigns and  Parsl  for massively parallel campaigns. Prefect benefits from having substantial observability into running calculations via its GUI but has some limitations in terms of packing multiple, concurrent MPI calculations in single Slurm allocation. Parsl is custom-made for HPC and has an assortment of features to ensure that jobs make it through the queue quickly, although it lacks the insightful job monitoring/re-dispatching features that Prefect has.

Dispatching Jobs

In practice, you will probably want to run your calculations on one or more HPC machines. Refer to the " Deploying Calculations " section of the quacc documentation for how to do so.