JFR Setup on Princeton Machines

Tiger

Regular Submission

The following is a representative setup for Jobflow Remote, specifically for use on our tiger-arrk login node.
    Run pip install jobflow-remote in your Conda environment on tiger-arrk
    Run jf project generate cms in the terminal
    Replace the file ~/.jfremote/cms.yaml with the following. Edit any necessary fields (i.e. those with <NetID>). If you have not completed the instructions in  💽Databases , do that first. This YAML will create a project named cms with two workers named basic_python and basic_vasp, but you can always add other workers to this file too for different kinds of calculations.
    The Slurm resource information can be found in the resources section and can be adjusted as needed (such as the time). All available keys for resources can be found in the  qtoolkit documentation  (see the ${} entries).
name: cms
workers:
basic_vasp:
type: local
scheduler_type: slurm
work_dir: /scratch/gpfs/ROSENGROUP/<NetID>/jobflow/vasp
pre_run: |
source ~/.bashrc
module load anaconda3/2025.12
conda activate cms
module load vasp/6.5.1
export QUACC_VASP_PARALLEL_CMD="srun --nodes 1 --ntasks-per-node 112"
export QUACC_WORKFLOW_ENGINE=jobflow
timeout_execute: 60
max_jobs: 50
resources:
nodes: 1
ntasks_per_node: 112
cpus_per_task: 1
mem: 900G
time: 04:00:00
account: rosengroup
basic_python:
type: local
scheduler_type: slurm
work_dir: /scratch/gpfs/ROSENGROUP/<NetID>/jobflow/python
pre_run: |
source ~/.bashrc
module load anaconda3/2025.12
conda activate cms
export QUACC_WORKFLOW_ENGINE=jobflow
timeout_execute: 60
resources:
nodes: 1
ntasks_per_node: 1
cpus_per_task: 1
mem: 8G
time: 04:00:00
account: rosengroup
queue:
store:
type: MongoStore
host: localhost
database: <MongoDB Database Name>
username: <MongoDB UserName>
password: <MongoDB PW>
collection_name: jf_jobs
flows_collection: jf_flows
auxiliary_collection: jf_aux
batches_collection: jf_batches
exec_config: {}
jobstore:
docs_store:
type: MongoStore
database: <MongoDB Database Name>
host: localhost
username: <MongoDB UserName>
password: <MongoDB PW>
collection_name: jf_outputs
    Run jf project check --errors to confirm that everything is set up correctly.
    Launch the runner via jf runner start
    Confirm that everything truly works by running the following minimal example.
from jobflow_remote.utils.examples import add
from jobflow_remote import submit_flow
from jobflow import Flow

job1 = add(1, 2)
job2 = add(job1.output, 2)
flow = Flow([job1, job2])

ids = submit_flow(flow, worker="basic_python")
print(ids)
    If functional, you should see two job IDs via jf job list and one flow ID via jf flow list. Eventually, they should reach a state of COMPLETED once they get through the Slurm queue. Keep refreshing the jf job list until they are COMPLETED.
    Note that we have imported a pre-defined add function. This is because Jobflow Remote needs all function calls to be importable.
    Importantly, check out the results of your run in your MongoDB database. See  💽Databases  for details on how to access your MongoDB database.
    When you're done running workflows, you can terminate the daemon via jf runner stop so it's not just running idle.
If you ever need to update the worker configuration, you will need to restart the active runner via jf runner stop and then jf runner start for the changes to be seen by the runner.

Batch Submission

The "Regular Submission" approach submits one @job per Slurm job. For instance, if max_jobs: 50, then you would have at most 50 Slurm jobs in the queue and at most 50 @job-decorated functions running at a time.
Jobflow-Remote offers other queuing options as well. One is  batch mode . In batch mode, each Slurm job will continually pull in new work until the walltime is hit. This is convenient if your @jobs are quite short and you don't want to have to submit a new Slurm job for every @job.
To use batch mode, modify the YAML as follows. Note the inclusion of the batch: field. This tells Jobflow-Remote to use batch mode. In this setup, Jobflow-Remote will launch at most 50 Slurm jobs, and each one will continually run a new @job until the walltime is hit.
Batch mode runs the risk of some @jobs timing out out once the walltime is hit, which you will then have to rerun. Optionally, you can set the max_time: <int> field under the batch field to define the number of seconds after which a new @job will not start. As such, max_time should be a value less than the walltime.
name: cms
workers:
<Worker Name>:
max_jobs: 50
resources:
nodes: 1
ntasks_per_node: 112
cpus_per_task: 1
mem: 900G
time: 04:00:00
account: rosengroup
batch:
jobs_handle_dir: /scratch/gpfs/ROSENGROUP/<NetID>/jobflow/vaps/jfr_handle_dir
work_dir: /scratch/gpfs/ROSENGROUP/<NetID>/jobflow/vasp/jfr_batch_jobs
basic_python:

Parallel Batch Mode

Finally, there is  parallel batch mode . Parallel batch mode is like batch mode but allows you to run multiple @jobs concurrently per Slurm job. This is particularly useful if you want to request a large number of nodes per Slurm job and have a small number of jobs in the queue. This is generally not needed on Tiger but is useful for other machines. We will demonstrate how to do so anyway.
The YAML for a parallel batch job will look like the following. Note the addition of the parallel_jobs field, which tells Jobflow-Remote how many @jobs to run in parallel in a given Slurm job. Here, we have chosen to have parallel_jobs: 4 to indicate that there are 4 VASP jobs per Slurm job. Accordingly, we have set nodes: 4 so that each Slurm job requests 4 nodes and has enough resources to run 4 one-node VASP jobs. We reduced max_jobs: 5 so that we do not have an enormous number of jobs in the queue.
name: cms
workers:
<Worker Name>:
max_jobs: 5
resources:
nodes: 4
ntasks_per_node: 112
cpus_per_task: 1
mem: 900G
time: 04:00:00
account: rosengroup
batch:
jobs_handle_dir: /scratch/gpfs/ROSENGROUP/<NetID>/jobflow/vaps/jfr_handle_dir
work_dir: /scratch/gpfs/ROSENGROUP/<NetID>/jobflow/vasp/jfr_batch_jobs
basic_python
parallel_jobs: 4

Della

You can use Jobflow-Remote to submit jobs on Della too, but everything must still be orchestrated from the tiger-arrk login node. In this case, you have to make a new worker and set the type to remote instead of local. The steps are outlined below:
    Make sure that you can ssh between tiger-arrk and Della without a password or 2FA. To do this, you need to set up SSH keys between tiger-arrk and Della. If you followed the  😴Removing Tedium  guide, then it's the same process except now it's between tiger-arrk and Della instead of between your local machine and the clusters.
    Install Jobflow-Remote on Della, and make sure that the versions of both Jobflow and Jobflow-Remote are the same on both machines.
    Modify your Jobflow-Remote YAML config file to add another worker that has the appropriate details for submitting a job on Della. For instance, add the following worker to your list of workers:
name: cms
workers:
basic_della_ml:
type: remote
host: della.princeton.edu
user: <NetID>
scheduler_type: slurm
work_dir: /scratch/gpfs/ROSENGROUP/<NetID>/jobflow/ml
pre_run: |
source ~/.bashrc
module load anaconda3/2025.12
conda activate cms
export QUACC_WORKFLOW_ENGINE=jobflow
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
timeout_execute: 60
resources:
gres: gpu:1
nodes: 1
ntasks_per_node: 20
cpus_per_task: 1
mem: 36G
time: 04:00:00
account: rosengroup
    Run jf runner restart to ensure the changes take place.
    Run jf project check --errors to ensure there are no configuration errors.
    You can then submit jobs and flows to Della from tiger-arrk by using your newly created worker.

Running Jobs on Stellar

You can use Jobflow-Remote to submit jobs on Stellar too, but everything must still be orchestrated from the tiger-arrk login node. In this case, you have to make a new worker and set the type to remote instead of local. The steps are outlined below:
    Make sure that you can ssh between tiger-arrk and Della without a password or 2FA. To do this, you need to set up SSH keys between tiger-arrk and Della. If you followed the  😴Removing Tedium  guide, then it's the same process except now it's between tiger-arrk and Della instead of between your local machine and the clusters.
    Install Jobflow-Remote on Stellar, and make sure that the versions of both Jobflow and Jobflow-Remote are the same on both machines.
    Modify your Jobflow-Remote YAML config file to add another worker that has the appropriate details for submitting a job on Stellar. For instance, add the following worker to your list of workers:
names: cms
workers:
basic_stellar_vasp:
type: remote
host: stellar.princeton.edu
user: <NetID>
scheduler_type: slurm
work_dir: /scratch/gpfs/ROSENGROUP/<NetID>/jobflow/vasp
max_jobs: 1
pre_run: |
source ~/.bashrc
module load anaconda3/2025.12
conda activate cms
module load vasp/6.5.1
export QUACC_VASP_PARALLEL_CMD="srun -N 1 --ntasks-per-node 96"
export QUACC_WORKFLOW_ENGINE=jobflow
timeout_execute: 60
resources:
account: cbe
time: 04:00:00
nodes: 1
ntasks_per_node: 96
cpus_per_task: 1
mem: 700G