JFR Setup on Perlmutter

Overview

Here, we have some instructions on setting up Jobflow-Remote on  🌌Perlmutter .

Setup

  • If you haven't already requested a MongoDB collection as described in  🌌Perlmutter , be sure to do that first and save your credentials somewhere.
  • Make sure your desired Conda environment is using custodian>==2025.10.11
  • After installing jobflow-remote, run jf project generate cms.
  • Edit ~/.jfremote/cms.yaml as described in the "Parallel Batch Setup" section below.
  • Run jf project check --errors to make sure everything looks okay.
  • Run jf runner start to start the daemon.
  • Test out a sample calculation.

Daemon Details

The Jobflow-Remote daemon can be run on the login node without issue. You will just need to occasionally check that the server is still running, perhaps once a day or so.
Note that Perlmutter has you randomly assigned to a login node at the time of SSH, which means if you log out and back in, you might land on a different login node than where the runner is. If you do jf runner start, Jobflow-Remote will tell you which login node the runner is on (if it is not your own). Then you can login to that node via ssh login23 from one of the other login nodes (if the process is running on login node 23, for instance).
Note: There is a bug with this as discussed in  https://github.com/Matgenix/jobflow-remote/issues/373 . Delete this comment when the issue is resolved. Until this is resolved, if you want to stop the runner, you will need to manually kill the processes (e.g. pkill -u <Name>).

Parallel Batch Setup

Since Perlmutter has very long queue times, it is often best to use the parallel batch job submission mode. This allows you to request a large Slurm allocation and run many jobs within it.
The example below will submit one 5-node Slurm job (in the debug queue), which will run 5 concurrent one-node VASP simulations until the wall-time is reached. Note that on Perlmutter, you get a 50% discount on Slurm jobs that request 128 GPU nodes or more or 256 CPU nodes or more. Only scale up after testing things out. Make sure to replace all entries in <>.
Remember to change qos: debug to qus: regular as well as the time when you are done testing. Also note that the nodes and parallel_jobs arguments should both be changed when scaling up.
name: cms
workers:
basic_vasp:
type: local
scheduler_type: slurm
work_dir: </pscratch/path/to/my/workdir/jfr>
pre_run: |
source ~/.bashrc
conda activate cms
module load vasp/6.5.1_gpu
export OMP_NUM_THREADS=8
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
export QUACC_VASP_PARALLEL_CMD="srun -N 1 -n 4 -c 32 --exclusive --cpu_bind=cores -G 4 --gpu-bind=none stdbuf --output=L"
export QUACC_WORKFLOW_ENGINE=jobflow
export QUACC_CREATE_UNIQUE_DIR=False
timeout_execute: 60
max_jobs: 1
resources:
nodes: 5
gres: gpu:4
time: 00:30:00
account: m5039_g
qos: debug
constraint: gpu
mail_user: <your email address>
mail_type: BEGIN,END,FAIL
batch:
jobs_handle_dir: </pscratch/path/to/my/workdir/jfr_handles>
work_dir: </pscratch/path/to/my/workdir/jfr>
parallel_jobs: 5
queue:
store:
type: MongoStore
host: <mongodb07.nersc.gov>
database: <MongoDB Database Name>
username: <MongoDB UserName>
password: <MongoDB PW>
collection_name: jf_jobs
flows_collection: jf_flows
auxiliary_collection: jf_aux
exec_config: {}
jobstore:
docs_store:
type: MongoStore
database: <MongoDB Database Name>
host: <mongodb07.nersc.gov>
username: <MongoDB UserName>
password: <MongoDB PW>
collection_name: jf_outputs
additional_stores:
data:
type: GridFSStore
database: <MongoDB Database Name>
host: <mongodb07.nersc.gov>
username: <MongoDB UserName>
password: <MongoDB PW>
collection_name: jf_output_blobs

Atomate2 Details

If you are using Atomate2, also make the following ~/.atomate2.yaml file, which will have each VASP job run on one GPU node:
VASP_CMD: srun -N 1 -n 4 -c 32 --exclusive --cpu_bind=cores -G 4 --gpu-bind=none stdbuf --output=L vasp_std
VASP_GAMMA_CMD: srun -N 1 -n 4 -c 32 --exclusive --cpu_bind=cores -G 4 --gpu-bind=none stdbuf --output=L vasp_gam