Fine-tuning Procedures

Universal Model for AtomsReferences https://fair-chem.github.io/fine-tuning 
IntroThe FAIR Chemistry team has already documented how to  fine-tune UMA models . However, some parts of the instructions are specific to the FAIR cluster setup, so additional adjustments are necessary to follow them seamlessly on della-gpu.
Enabling External Network Access Here is some context that will help you understand this section:
The fairchem package runs training or fine-tuning jobs by automatically submitting a bash script to the SLURM scheduler through the  submitit  package. You do not need to know the internal details of this package — only that it generates the bash script and submits it on our behalf. While this is very convenient, modifying the generated bash script can be somewhat cumbersome.
When running training or fine-tuning, the fairchem package defaults to using  Weights & Biases  for logging. This requires network access from compute nodes, but compute nodes on della-gpu do not have network access by default (as also mentioned above).
Therefore, we need to add module load proxy/default to the generated bash script so that compute nodes can access external networks. To do this, assuming you have already installed the fairchem package, first uninstall submitit so we can reinstall it from a local copy:
pip uninstall submitit
Clone the submitit repository and move into it:
git clone https://github.com/facebookincubator/submitit.git
cd submitit
Edit slurm/slurm.py and insert the following line around line 530 to hard-code loading the proxy module:
    lines += [
        "",
        "# command",
        "export SUBMITIT_EXECUTOR=slurm",
        # Add this line to load the proxy module
        "module load proxy/default",
        # The input "command" is supposed to be a valid shell command
        command,
        "",
    ]
Install the modified package locally in editable mode:
pip install -e .
Modifying YAML filesWhen you follow the  Generating Training/Fine-tuning Datasets  section, you will eventually obtain uma_sm_finetune_template.yaml and data/uma_conserving_data_task_energy_force_stress.yaml in your specified --output-dir. These YAML files are used to configure fine-tuning.
uma_sm_finetune_template.yaml
You will see the job section begins around line 4. This section defines parameters used to generate the SLURM submission script as well as the W&B configuration. Below is an example setup compatible with della-gpu.
job:
  device_type: CUDA
  scheduler:
    mode: SLURM
    ranks_per_node: 1
    num_nodes: 1
    slurm:
      timeout_hr: 6
      cpus_per_task: 6
      mem_gb: 32
      additional_parameters:
        time: 0-12:00:00
        gres: gpu:1
        constraint: intel&gpu80
        mail_user: <NetID>@princeton.edu
        mail_type: begin,end,fail
  debug: false
  run_dir: <your_dir_name>/
  run_name: <your_run_name>
  logger:
    _target_: fairchem.core.common.logger.WandBSingletonLogger.init_wandb
    _partial_: true
    entity: <your_wandb_team_name>
    project: <your_wandb_project_name>
Under job.scheduler.slurm, several required parameters must be defined: timeout_hr, cpus_per_task, and mem_gb. For cpus_per_task and mem_gb, I recommend using the values stated above. The value of timeout_hr can be set arbitrarily, since it will be overridden in the additional_parameters section.
The additional_parameters field allows you to specify extra SLURM directives (i.e., lines that appear after #SBACTH in the generated script). Use this section to customize the job configuration, such as wall time (time), GPU node type (constraint), and other SLURM options you typically use.
Set job.debug to false to enable W&B logging.
For job.run_dir, use a directory name of your choice. Logs and checkpoints for each run will be saved there.
job.run_name is used to label each run in your W&B project.
Fill in your team name and project name under job.logger.entity and job.logger.project, respectively. These should remain consistent across runs, especially when performing hyperparameter tuning, so that evaluation metrics appear in the same W&B project dashboard.
Under runner.train_eval_unit.model.checkpoint_location, you will see _target_ and model_name are predefined. By default, this configuration attempts to download the checkpoint file from its Hugging Face repository. However, even after successfully loading the proxy module in the previous section, this download does not work as expected. Therefore, please delete (or comment out) these two parameters. Instead, I recommend directly specifying the checkpoint_location parameter by providing the local path to the UMA model checkpoint you intend to fine-tune. In this case, since the base_model_name parameter will no longer be used anywhere in the configuration file, you may also delete (or comment out) its declaration.
data/uma_conserving_data_task_energy_force_stress.yaml
Change train_dataset.splits.train.src (around line 107) and val_dataset.splits.val.src (around line 117) to absolute paths. This allows you to run the command fairchem -c uma_sm_finetune_template.yaml from any directory.
In this file, you can also modify the loss function coefficients for energy per atom, forces, and stresses. Look for the coefficient fields and adjust them as needed. The default ratio is 20:2:1.
UPETReferences https://docs.metatensor.org/metatrain/latest/concepts/fine-tuning.html 
 https://docs.metatensor.org/metatrain/latest/generated_examples/0-beginner/02-fine-tuning.html 
 https://atomistic-cookbook.org/examples/pet-finetuning/pet-ft.html 
[optional]  https://lab-cosmo.github.io/upet/latest/fine-tuning.html 
IntroFine-tuning various PET models is also well documented in the references listed above. I recommend reading them in the order shown above to get a better sense of how to set everything up. The fourth reference is included to simply list all documents related to fine-tuning, but the information there is quite minimal, so it can be skipped.
The key to fine-tuning PET models is customizing the settings in options.yaml and passing it to mtt train options.yaml through the metatrain package. The first reference introduces the basic workflow, while the second provides more detailed guidance on how to configure options.yaml. The third reference covers fine-tuning in the most detail, but it may feel overwhelming without first reading the first two, so I recommend following that order.
There are two things I want to discuss in this section:
How to prepare the data
How to continue fine-tuning in the same W&B run when a job finishes or is interrupted
How to prepare the dataThis is also well documented  here . Here are a few additional tips:
ase.trajectory also works as an input format
In addition to ase.trajectory, any trajectory format that can be read with ase.io.read is supported, not just the XYZ format mentioned in the documentation
If your dataset contains more than 1 million data points, you may want to skip the DiskDataset section and move directly to MemmapDataset
How to continue fine-tuning in the same W&BYou may want to continue fine-tuning within the same W&B as a previous job, or there may be cases where a job is stopped by the scheduler (or for other reasons) and you want to resume it while keeping the training progress in a single run. In that case, simply updating  architecture.training.finetune.read_from in options.yaml is not enough to continue the learning curve in the same W&B run.
architecture
  training:
    finetune:
      method: "full"
      read_from: /path/to/checkpoint.ckpt
This is because, by design, each training/fine-tuning job starts from epoch 0. As a result, the existing learning curve in W&B will be overwritten by the new run. To properly resume training from the desired epoch, some modifications to the source code are needed so that the metatrain package load the epoch information from the checkpoint.
You need to modify 2 files:
metatrain/src/metatrain/cli/train.py
In line 671, replace:
trainer = Trainer(hypers["training"])
 with:
if 'huggingface' in restart_from:
    trainer = Trainer(hypers["training"])
else:
    trainer = trainer_from_checkpoint(
        checkpoint=checkpoint,
        hypers=hypers["training"],
        context=training_context)
This is currently hard-coded in my setup, so if you find a cleaner solution, please let me know. The purpose of this modification is to allow the training process to start from epoch 0 when loading a checkpoint from Hugging Face (i.e., when starting a new fine-tuning run), while resuming from the saved epoch when loading from an existing fine-tuning checkpoint.
metatrain/src/metatrain/pet/trainer.py
In line 752, replace:
trainer.epoch = None
with:
trainer.epoch = checkpoint["epoch"]
Finally, do not forget to update the wandb section in options.yaml as follows:
wandb:
  entity: your_group_name
  project: your_project_name
  name: your_run_name
  resume: must
  id: your_run_id
This allows W&B to recognize the resumed training as the same run instead of creating a new one.