JFR Setup on Perlmutter

OverviewHere, we have some instructions on setting up Jobflow-Remote on ﻿ 🌌⁠⁠Perlmutter⁠ .
SetupIf you haven't already requested a MongoDB collection as described in ﻿ 🌌⁠⁠Perlmutter⁠ , be sure to do that first and save your credentials somewhere.
After installing jobflow-remote, run jf project generate cms.
Edit ~/.jfremote/cms.yaml as described in the "Parallel Batch Setup" section below.
Run jf project check --errors to make sure everything looks okay.
Run jf runner start to start the daemon.
Test out a sample calculation.
Daemon DetailsThe Jobflow-Remote daemon can be run on the login node without issue. You will just need to occasionally check that the server is still running, perhaps once a day or so.
Note that Perlmutter has you randomly assigned to a login node at the time of SSH, which means if you log out and back in, you might land on a different login node than where the runner is. To make life easier, write down the login node number where you run Jobflow-Remote and you can SSH there. For instance, you can do ssh login23 from one of the other login nodes (if the process is running on login node 23, for instance).
Parallel Batch SetupSince Perlmutter has very long queue times, it is often best to use the parallel batch job submission mode. This allows you to request a large Slurm allocation and run many jobs within it.
The example below will submit one 5-node Slurm job (in the debug queue), which will run 5 concurrent one-node VASP simulations until the wall-time is reached. Note that on Perlmutter, you get a 50% discount on Slurm jobs that request 128 GPU nodes or more or 256 CPU nodes or more. Only scale up after testing things out. Make sure to replace all entries in <>.
Remember to change qos: debug to qos: regular as well as the time when you are done testing. Also note that the nodes and parallel_jobs arguments should both be changed when scaling up.
If you need to add a reservation flag, you can do so by adding qverbatim: "#SBATCH --reservation=MyReservationName" in the resources section. Note that the quotes are needed so the # is not interpreted as a comment.
name: cms
workers:
  basic_vasp:
    type: local
    scheduler_type: slurm
    work_dir: </path/to/cfs/storage/jfr>
    pre_run: |
      source ~/.bashrc
      module load conda
      conda activate cms
      module load vasp/6.5.1_gpu
      export OMP_NUM_THREADS=8
      export OMP_PLACES=threads
      export OMP_PROC_BIND=spread
      export QUACC_VASP_PARALLEL_CMD="srun -N 1 -n 4 -c 32 --exclusive --cpu_bind=cores -G 4 --gpu-bind=none stdbuf --output=L"
      export QUACC_WORKFLOW_ENGINE=jobflow
      export QUACC_CREATE_UNIQUE_DIR=False
    timeout_execute: 60
    max_jobs: 4
    resources:
      nodes: 5
      gres: gpu:4
      time: 00:30:00
      account: m5039_g
      qos: debug
      constraint: gpu
      mail_user: <your email address>
      mail_type: BEGIN,END,FAIL
    batch:
      jobs_handle_dir: </path/to/cfs/storage/jfr_handles>
      work_dir: </path/to/cfs/storage/jfr>
      parallel_jobs: 5
queue:
  store:
    type: MongoStore
    host: <mongodb07.nersc.gov>
    database: <MongoDB Database Name>
    username: <MongoDB UserName>
    password: <MongoDB PW>
    collection_name: jf_jobs
  flows_collection: jf_flows
  auxiliary_collection: jf_aux
  batches_collection: jf_batches
exec_config: {}
jobstore:
  docs_store:
    type: MongoStore
    database: <MongoDB Database Name>
    host: <mongodb07.nersc.gov>
    username: <MongoDB UserName>
    password: <MongoDB PW>
    collection_name: jf_outputs
  additional_stores:
    data:
      type: GridFSStore
      database: <MongoDB Database Name>
      host: <mongodb07.nersc.gov>
      username: <MongoDB UserName>
      password: <MongoDB PW>
      collection_name: jf_output_blobs
Atomate2 DetailsIf you are using Atomate2, also make the following ~/.atomate2.yaml file, which will have each VASP job run on one GPU node:
VASP_CMD: srun -N 1 -n 4 -c 32 --exclusive --cpu_bind=cores -G 4 --gpu-bind=none stdbuf --output=L vasp_std
VASP_GAMMA_CMD: srun -N 1 -n 4 -c 32 --exclusive --cpu_bind=cores -G 4 --gpu-bind=none stdbuf --output=L vasp_gam
CUSTODIAN_SCRATCH_DIR: /pscratch/path/to/run/custodian
﻿