KAUST GPU HACKATHON 2018
Programming GPU with OpenACC
The introduction talk from Dr Saber Feki can be downloaded here
- Connection to IBEX:
ssh -X username@glogin.ibex.kaust.edu.sa
- Go to the following location:
cd /scratch/dragon/amd/$USER
mkdir gpuhackathon18
cd gpuhackathon18
From now on we consider that you are always inside the folder /scratch/dragon/amd/$USER/gpuhackathon18
- Before you start working on IBEX, load the following modules:
module load cuda/9.0.176
module load pgi/17.10
- Compilation
Example:
Serial Laplace:
In case that you plan to use PGI compiler:
pgcc -ta=tesla:cc60 -O2 -Minfo=all -o laplace_serial src/laplace_serial.c
- -Minfo=all: Informs you about all the messages
Output:
main:
29, Loop not vectorized/parallelized: contains call
32, Generated an alternate version of the loop
Generated vector simd code for the loop
Generated 3 prefetch instructions for the loop
Generated vector simd code for the loop
Generated 3 prefetch instructions for the loop
41, Generated vector simd code for the loop containing reductions
Generated 2 prefetch instructions for the loop
initialize:
68, Memory zero idiom, loop replaced by call to __c_mzero8
73, Generated vector simd code for the loop
Residual loop unrolled 2 times (completely unrolled)
78, Generated vector simd code for the loop
Residual loop unrolled 2 times (completely unrolled)
track_progress:
90, Loop not vectorized/parallelized: contains call
- Execution
Submission script:
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --job-name="test"
#SBATCH --gres=gpu:p100:1
#SBATCH --res=HACKATHON_TEAMS
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --exclusive
#SBATCH --err=JOB.%j.err
#SBATCH --output=JOB.%j.out
#--------------------------------------------#
module load cuda/9.0.176
module load pgi/17.10
srun -n 1 --hint=nomultithread ./laplace_serial
Submit:
sbatch submit_laplace_serial.sh
Source: submission_laplace_serial
- Profiling
Compile your code for CPU (remove -acc and -ta from the PGI compilation if they were included)
Execute:
sbatch submit_profiling_terminal.sh
Source: submission_profile_terminal
Open the output file and see the profiling information.
To use a GUI:
-
Use the submission file submission_profile_file
-
Execute:
nvvp results.nvprof
- Laplace version with initial OpenACC pragmas
pgcc -O2 -ta=tesla:cc60 -acc -Minfo=accel -o laplace_bad_acc src/laplace_bad_acc.c
Flags:
- -acc: activates the OpenACC compilation
- -Minfo=accel: Informs you about the accelerators messages
Output:
main:
31, Generating implicit copyout(Temperature[1:1000][1:1000])
Generating implicit copyin(Temperature_previous[:][:])
32, Loop is parallelizable
33, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
32, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
33, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
41, Generating implicit copy(Temperature_previous[1:1000][1:1000])
Generating implicit copyin(Temperature[1:1000][1:1000])
42, Loop is parallelizable
43, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
42, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
43, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
44, Generating implicit reduction(max:worst_dt)
You can modify the time limit and the number of the tasks, if required.
Use compiler pgf90 for Fotran and pgc++ for C++
Options: -ta=tesla:cc60 -Minfo=all,intensity
-
Minfo=all It will provide all the compiler information (including the acceleration), for example Loop not vectorized: loop count too small Loop unrolled 6 times (completely unrolled)
-
Minfo=intensity Provides the intensity of all the loops, intensity is the (Compute operations/Memory Operations), if it is more or equal to 1.0 then we should move this loop to GPUs, otherwise not.
- Execution
Submission script:
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --job-name="test"
#SBATCH --gres=gpu:p100:1
#SBATCH --res=HACKATHON_TEAMX
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --exclusive
#SBATCH --err=JOB.%j.err
#SBATCH --output=JOB.%j.out
#--------------------------------------------#
module load cuda/9.0.176
module load pgi/17.10
srun -n 1 --hint=nomultithread ./laplace_bad_acc
Submit:
sbatch submit_laplace_bad_acc.sh
Source: submission_laplace_bad_acc
Modify the X according to your team number (1-6) In the above exampe we want to use one Nvidia P100 card, if you plan to use 2 cards then declare:
#SBATCH --gres=gpu:p100:2
- Profiling
Adjust the name of the binary in all job scripts
Execute:
sbatch submit_profiling_terminal.sh
Source: submission_profile_terminal
Open the output file and see the profiling information.
To use a GUI:
-
Use the submission file submission_profile_file
-
Execute:
nvvp results.nvprof
- Latest version with optimized OpenACC pragmas
Repeat the previous instructions with file laplace_final_acc.c
Tips
Use PGI with MPI
module load pgi/17.10
module load pgi/17.10_openmpi
Material
Ibex cheat sheet OpenACC web page
CUDA Cuda with C/C++ Cuda with Fortran
Deep Neural Networks - Cudnn Load:
module load cudnn