JFR Setup on Perlmutter

Overview

Here, we have some instructions on setting up Jobflow-Remote on  🌌Perlmutter .

Setup

  • If you haven't already requested a MongoDB collection as described in  🌌Perlmutter , be sure to do that first and save your credentials somewhere.
  • After installing jobflow-remote, run jf project generate cms.
  • Edit ~/.jfremote/cms.yaml as described in the "Parallel Batch Setup" section below.
  • Run jf project check --errors to make sure everything looks okay.
  • Run jf runner start to start the daemon.
  • Test out a sample calculation.

Daemon Details

The Jobflow-Remote daemon can be run on the login node without issue. You will just need to occasionally check that the server is still running, perhaps once a day or so.
Note that Perlmutter has you randomly assigned to a login node at the time of SSH, which means if you log out and back in, you might land on a different login node than where the runner is. To make life easier, write down the login node number where you run Jobflow-Remote and you can SSH there. For instance, you can do ssh login23 from one of the other login nodes (if the process is running on login node 23, for instance).

Parallel Batch Setup

Since Perlmutter has very long queue times, it is often best to use the parallel batch job submission mode. This allows you to request a large Slurm allocation and run many jobs within it.
The example below will submit one 5-node Slurm job (in the debug queue), which will run 5 concurrent one-node VASP simulations until the wall-time is reached. Note that on Perlmutter, you get a 50% discount on Slurm jobs that request 128 GPU nodes or more or 256 CPU nodes or more. Only scale up after testing things out. Make sure to replace all entries in <>.
Remember to change qos: debug to qos: regular as well as the time when you are done testing. Also note that the nodes and parallel_jobs arguments should both be changed when scaling up.
If you need to add a reservation flag, you can do so by adding qverbatim: "#SBATCH --reservation=MyReservationName" in the resources section. Note that the quotes are needed so the # is not interpreted as a comment.
name: cms
workers:
basic_vasp:
type: local
scheduler_type: slurm
work_dir: </path/to/cfs/storage/jfr>
pre_run: |
source ~/.bashrc
module load conda
conda activate cms
module load vasp/6.5.1_gpu
export OMP_NUM_THREADS=8
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
export QUACC_VASP_PARALLEL_CMD="srun -N 1 -n 4 -c 32 --exclusive --cpu_bind=cores -G 4 --gpu-bind=none stdbuf --output=L"
export QUACC_WORKFLOW_ENGINE=jobflow
export QUACC_CREATE_UNIQUE_DIR=False
timeout_execute: 60
max_jobs: 1
resources:
nodes: 5
gres: gpu:4
time: 00:30:00
account: m5039_g
qos: debug
constraint: gpu
mail_user: <your email address>
mail_type: BEGIN,END,FAIL
batch:
jobs_handle_dir: </path/to/cfs/storage/jfr_handles>
work_dir: </path/to/cfs/storage/jfr>
parallel_jobs: 5
queue:
store:
type: MongoStore
host: <mongodb07.nersc.gov>
database: <MongoDB Database Name>
username: <MongoDB UserName>
password: <MongoDB PW>
collection_name: jf_jobs
flows_collection: jf_flows
auxiliary_collection: jf_aux
batches_collection: jf_batches
exec_config: {}
jobstore:
docs_store:
type: MongoStore
database: <MongoDB Database Name>
host: <mongodb07.nersc.gov>
username: <MongoDB UserName>
password: <MongoDB PW>
collection_name: jf_outputs
additional_stores:
data:
type: GridFSStore
database: <MongoDB Database Name>
host: <mongodb07.nersc.gov>
username: <MongoDB UserName>
password: <MongoDB PW>
collection_name: jf_output_blobs

Atomate2 Details

If you are using Atomate2, also make the following ~/.atomate2.yaml file, which will have each VASP job run on one GPU node:
VASP_CMD: srun -N 1 -n 4 -c 32 --exclusive --cpu_bind=cores -G 4 --gpu-bind=none stdbuf --output=L vasp_std
VASP_GAMMA_CMD: srun -N 1 -n 4 -c 32 --exclusive --cpu_bind=cores -G 4 --gpu-bind=none stdbuf --output=L vasp_gam
CUSTODIAN_SCRATCH_DIR: /pscratch/path/to/run/custodian