Meta has released the Universal Model for Atoms (UMA), which is a large foundation machine-learned potential trained on a diverse range of materials. Installing and launching UMA can be a bit tricky at first, so we outline the steps here. Apply for model access at https://huggingface.co/facebook/UMA and sign the agreement. Make sure to put the full university name (e.g. "Princeton University" and not "Princeton"). Add an access token to your Hugging Face account (Account > Access Tokens) and give the token read access permissions to the facebook/UMA repository.
This is the screen you should see when adding an access token. Only facebook/UMA is needed.
On the machine where you plan to run UMA and in a fresh Python environment, do pip install huggingface_hub[cli] to install the Hugging Face CLI tool. Remember that UMA is a machine-learned potential, so it should be run on a machine with GPUs.
Optional but recommended: In your ~/.bashrc, set the HF_HOME environment variable to some location where storage limits are not much of a concern, like in a scratch directory. For instance, that might look like export HF_HOME=/scratch/gpfs/ROSENGROUP/<NetID>/huggingface. Then make sure to source ~/.bashrc so the changes are reflected.
On the login node of the machine, run hf auth login and paste your <HFToken> is the token from Step 3. Alternatively, you can define an environment variable named HF_TOKEN in your ~/.bashrc.
Install the fairchem code via pip install fairchem-core in the fresh environment that you made in Step 4.
Run a minimal calculation on the login node with your model of choice (example below). We are doing this because the first time you run UMA, it needs to download a few files, and the compute nodes do not have network access. If you wish to use a different model, simply modify the model_name variable accordingly.
from ase.build import bulk
from fairchem.core import pretrained_mlip, FAIRChemCalculator
atoms = bulk("Si")
model_name = "uma-s-1p1"
predictor = pretrained_mlip.get_predict_unit(model_name)
calc = FAIRChemCalculator(predictor, task_name="omat")
atoms.calc = calc
e = atoms.get_potential_energy()
print(e)
Assuming the above calculation runs, then you are good to run UMA on the compute nodes. The first time you run a calculation on the compute nodes, check that the model is using the GPUs appropriately and is not accidentally running on the CPU instead. If there are problems, it usually means that you need to re-install 🔥PyTorch with the appropriate configuration.
The FAIR Chemistry team has already documented how to fine-tune UMA models . However, some parts of the instructions are specific to the FAIR cluster setup, so additional adjustments are necessary to follow them seamlessly on della-gpu. Here is some context that will help you understand this section:
The fairchem package runs training or fine-tuning jobs by automatically submitting a bash script to the SLURM scheduler through the submitit package. You do not need to know the internal details of this package — only that it generates the bash script and submits it on our behalf. While this is very convenient, modifying the generated bash script can be somewhat cumbersome. When running training or fine-tuning, the fairchem package defaults to using Weights & Biases for logging. This requires network access from compute nodes, but compute nodes on della-gpu do not have network access by default (as also mentioned above). Therefore, we need to add module load proxy/default to the generated bash script so that compute nodes can access external networks. To do this, assuming you have already installed the fairchem package, first uninstall submitit so we can reinstall it from a local copy:
Clone the submitit repository and move into it:
git clone https://github.com/facebookincubator/submitit.git
cd submitit
Edit slurm/slurm.py and insert the following line around line 530 to hard-code loading the proxy module:
lines += [
"",
"# command",
"export SUBMITIT_EXECUTOR=slurm",
"module load proxy/default",
command,
"",
]
Install the modified package locally in editable mode:
When you follow the Generating Training/Fine-tuning Datasets section, you will eventually obtain uma_sm_finetune_template.yaml and data/uma_conserving_data_task_energy_force_stress.yaml in your specified --output-dir. These YAML files are used to configure fine-tuning. uma_sm_finetune_template.yaml
You will see the job section begins around line 4. This section defines parameters used to generate the SLURM submission script as well as the W&B configuration. Below is an example setup compatible with della-gpu.
job:
device_type: CUDA
scheduler:
mode: SLURM
ranks_per_node: 1
num_nodes: 1
slurm:
timeout_hr: 6
cpus_per_task: 6
mem_gb: 80
additional_parameters:
time: 0-12:00:00
gres: gpu:1
constraint: intel&gpu80
mail_user: <NetID>@princeton.edu
mail_type: begin,end,fail
debug: false
run_dir: finetune/
run_name: <your_run_name>
logger:
_target_: fairchem.core.common.logger.WandBSingletonLogger.init_wandb
_partial_: true
entity: <your_wandb_team_name>
project: <your_wandb_project_name>
Under job.scheduler.slurm, several required parameters must be defined: timeout_hr, cpus_per_task, and mem_gb. For cpus_per_task and mem_gb, I recommend using the values stated above. The value of timeout_hr can be set arbitrarily, since it will be overridden in the additional_parameters section.
The additional_parameters field allows you to specify extra SLURM directives (i.e., lines that appear after #SBACTH in the generated script). Use this section to customize the job configuration, such as wall time (time), GPU node type (constraint), and other SLURM options you typically use.
Set job.debug to false to enable W&B logging.
For job.run_dir, use a directory name of your choice. Logs and checkpoints for each run will be saved there.
job.run_name is used to label each run in your W&B project.
Fill in your team name and project name under job.logger.entity and job.logger.project, respectively. These should remain consistent across runs, especially when performing hyperparameter tuning, so that evaluation metrics appear in the same W&B project dashboard.
Under runner.train_eval_unit.model.checkpoint_location, you will see _target_ and model_name are predefined. By default, this configuration attempts to download the checkpoint file from its Hugging Face repository. However, even after successfully loading the proxy module in the previous section, this download does not work as expected. Therefore, please delete (or comment out) these two parameters. Instead, I recommend directly specifying the checkpoint_location parameter by providing the local path to the UMA model checkpoint you intend to fine-tune. In this case, since the base_model_name parameter will no longer be used anywhere in the configuration file, you may also delete (or comment out) its declaration.
data/uma_conserving_data_task_energy_force_stress.yaml
Change train_dataset.splits.train.src (around line 107) and val_dataset.splits.val.src (around line 117) to absolute paths. This allows you to run the command fairchem -c uma_sm_finetune_template.yaml from any directory.
In this file, you can also modify the loss function coefficients for energy per atom, forces, and stresses. Look for the coefficient fields and adjust them as needed. The default ratio should be .