Here, we have some instructions on setting up Jobflow-Remote on 🌌Perlmutter.
Setup
If you haven't already requested a MongoDB collection as described in 🌌Perlmutter, be sure to do that first and save your credentials somewhere.
After installing jobflow-remote, run jf project generate cms.
Edit ~/.jfremote/cms.yaml as described in the "Parallel Batch Setup" section below.
Run jf project check --errors to make sure everything looks okay.
Run jf runner start to start the daemon.
Test out a sample calculation.
Daemon Details
The Jobflow-Remote daemon can be run on the login node without issue. You will just need to occasionally check that the server is still running, perhaps once a day or so.
Note that Perlmutter has you randomly assigned to a login node at the time of SSH, which means if you log out and back in, you might land on a different login node than where the runner is. To make life easier, write down the login node number where you run Jobflow-Remote and you can SSH there. For instance, you can do ssh login23 from one of the other login nodes (if the process is running on login node 23, for instance).
Parallel Batch Setup
Since Perlmutter has very long queue times, it is often best to use the parallel batch job submission mode. This allows you to request a large Slurm allocation and run many jobs within it.
The example below will submit one 5-node Slurm job (in the debug queue), which will run 5 concurrent one-node VASP simulations until the wall-time is reached. Note that on Perlmutter, you get a 50% discount on Slurm jobs that request 128 GPU nodes or more or 256 CPU nodes or more. Only scale up after testing things out. Make sure to replace all entries in <>.
Remember to change qos: debug to qos: regular as well as the time when you are done testing. Also note that the nodes and parallel_jobs arguments should both be changed when scaling up.
If you need to add a reservation flag, you can do so by adding qverbatim: "#SBATCH --reservation=MyReservationName" in the resources section. Note that the quotes are needed so the # is not interpreted as a comment.