Install and use Tensorflow on GPU nodes
The Cholesky cluster has two GPU nodes each equipped with 4 Nvidia Tesla V100 graphics cards.
The corresponding computation queue is the gpu queue.
You can simply install the Tensorflow environment you need using Anaconda
Installation of Tensorflow
The installation must be done from a GPU node, interactively, with CUDA loaded, so that the graphics cards are correctly detected
Get an interractive shell on a GPU node via Slurm :
$ srun --nodes=1 --gres=gpu:1 --partition=gpu --time=01:30:00 --pty bash -i
Load the Cuda and Anaconda modules, create a dedicated Conda environment, then install Tensorflow :
$ module load anaconda3/2020.11 cuda/10.2
(base) $ conda create -n tf-gpu
(base) $ conda activate tf-gpu
(tf-gpu) $ conda install tensorflow-gpu
To install specific version of Tensorflow :
(tf-gpu) $ conda search tensorflw-gpu
(tf-gpu) $ conda install tensorflow-gpu==2.1.0
Then you can logout from the GPU node.
Use Tensorflow
We can run a Python script using Tensorflow, which will detect the number of available GPUs:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
Slurm script :
#!/bin/sh
#SBATCH --job-name=gpu-job
#SBATCH --time=120 # max 120 minutes
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2 #number of GPU to be used
module load anaconda3 cuda/10
conda activate tf-gpu
python3 $HOME/python/tensorflow/gpu-available.py
$ sbatch tf-job.slurm
Submitted batch job 1249
$ cat slurm-1249.out
[...]
pciBusID: 0000:1a:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-03-29 15:08:19.205278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:1c:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[...]
Num GPUs Available: 2