GPU Driver
During the deployment process of a Virtual Cluster a GPU driver is installed automatically. This GPU driver is tested during the release validation process. It is strongly recommended to stay with this driver version, but in some cases it can be useful or necessary to switch to another driver. The process to update a GPU driver different from the default one is described in this section.
- update is required. This can be done from the
service
instance or from theslurm
instance (if the cluster has a dedicated instance for the Slurm controller) by running the following command as a user with admin privileges:vc-instance-manager instance-template start <PARTITION-NAME>
- Log in to the template instance:
./connect <TEMPLATE-NAME>
- Download the new driver installer to the instance from the Nvidia
download site:
and replace
wget https://us.download.nvidia.com/tesla/<DRIVER-VERSION>/NVIDIA-Linux-x86_64-<DRIVER-VERSION>.run
<DRIVER-VERSION>
with the version number of the driver, e.g.470.57.02
. - Install the new driver (the old driver is automatically replaced):
and run
sudo chmod +x ./NVIDIA-Linux-x86_64-<DRIVER-VERSION>.run sudo ./NVIDIA-Linux-x86_64-<DRIVER-VERSION>.run --dkms -s
nvidia-smi
after the installation has completed to check that the driver works correctly. - In order to apply the change to all new instances created for the
partition the instance template for the partition needs to be re-created
from the modified template instance. This can be done from the
service
instance or from theslurm
instance (if the cluster has a dedicated instance for the Slurm controller) by running the following command as a user with admin privileges:All new instances created for partition “vc-instance-manager instance-template update --partition <PARTITION-NAME>
” will contain the upgraded GPU driver.
Note
The original driver version which was deployed with the Virtual Cluster
can be found in /var/drivers/gpu/nvidia/
. If you need to revert back
to the driver version which was originally deployed and tested for the
Virtual Cluster you can perform the steps above using the driver installer
found in this directory.