GPU Driver

During the deployment process of a Virtual Cluster a GPU driver is installed automatically. This GPU driver is tested during the release validation process. It is strongly recommended to stay with this driver version, but in some cases it can be useful or necessary to switch to another driver. The process to update a GPU driver different from the default one is described in this section.

  1. update is required. This can be done from the service instance or from the slurm instance (if the cluster has a dedicated instance for the Slurm controller) by running the following command as a user with admin privileges:
    vc-instance-manager instance-template start <PARTITION-NAME>
    
  2. Log in to the template instance:
    ./connect <TEMPLATE-NAME>
    
  3. Download the new driver installer to the instance from the Nvidia download site:
    wget https://us.download.nvidia.com/tesla/<DRIVER-VERSION>/NVIDIA-Linux-x86_64-<DRIVER-VERSION>.run
    
    and replace <DRIVER-VERSION> with the version number of the driver, e.g. 470.57.02.
  4. Install the new driver (the old driver is automatically replaced):
    sudo chmod +x ./NVIDIA-Linux-x86_64-<DRIVER-VERSION>.run
    sudo ./NVIDIA-Linux-x86_64-<DRIVER-VERSION>.run --dkms -s
    
    and run nvidia-smi after the installation has completed to check that the driver works correctly.
  5. In order to apply the change to all new instances created for the partition the instance template for the partition needs to be re-created from the modified template instance. This can be done from the service instance or from the slurm instance (if the cluster has a dedicated instance for the Slurm controller) by running the following command as a user with admin privileges:
    vc-instance-manager instance-template update --partition <PARTITION-NAME>
    
    All new instances created for partition “” will contain the upgraded GPU driver.

Note

The original driver version which was deployed with the Virtual Cluster can be found in /var/drivers/gpu/nvidia/. If you need to revert back to the driver version which was originally deployed and tested for the Virtual Cluster you can perform the steps above using the driver installer found in this directory.