KAUST GPU HACKATHON 2018

Programming GPU with OpenACC

The introduction talk from Dr Saber Feki can be downloaded here

Connection to IBEX:

ssh -X username@glogin.ibex.kaust.edu.sa

Go to the following location:

cd /scratch/dragon/amd/$USER 
mkdir gpuhackathon18 
cd gpuhackathon18

From now on we consider that you are always inside the folder /scratch/dragon/amd/$USER/gpuhackathon18

Before you start working on IBEX, load the following modules:

module load cuda/9.0.176
module load pgi/17.10 

Compilation

Example:

Serial Laplace:

In case that you plan to use PGI compiler:

pgcc -ta=tesla:cc60 -O2 -Minfo=all -o laplace_serial src/laplace_serial.c

-Minfo=all: Informs you about all the messages

Output:

main:
     29, Loop not vectorized/parallelized: contains call
     32, Generated an alternate version of the loop
         Generated vector simd code for the loop
         Generated 3 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 3 prefetch instructions for the loop
     41, Generated vector simd code for the loop containing reductions
         Generated 2 prefetch instructions for the loop
initialize:
     68, Memory zero idiom, loop replaced by call to __c_mzero8
     73, Generated vector simd code for the loop
         Residual loop unrolled 2 times (completely unrolled)
     78, Generated vector simd code for the loop
         Residual loop unrolled 2 times (completely unrolled)
track_progress:
     90, Loop not vectorized/parallelized: contains call

Execution

Submission script:

#!/bin/bash 
#SBATCH --partition=batch 
#SBATCH --job-name="test" 
#SBATCH --gres=gpu:p100:1
#SBATCH --res=HACKATHON_TEAMS
#SBATCH --nodes=1 
#SBATCH --ntasks=1
#SBATCH --time=00:10:00 
#SBATCH --exclusive 
#SBATCH --err=JOB.%j.err 
#SBATCH --output=JOB.%j.out 
#--------------------------------------------# 
module load cuda/9.0.176
module load pgi/17.10
srun -n 1 --hint=nomultithread ./laplace_serial

Submit:

sbatch submit_laplace_serial.sh

Source: submission_laplace_serial

Profiling

Compile your code for CPU (remove -acc and -ta from the PGI compilation if they were included)

Execute:

sbatch submit_profiling_terminal.sh

Source: submission_profile_terminal

Open the output file and see the profiling information.

To use a GUI:

Use the submission file submission_profile_file
Execute:
```
nvvp results.nvprof
```

Laplace version with initial OpenACC pragmas

pgcc -O2 -ta=tesla:cc60 -acc -Minfo=accel -o laplace_bad_acc src/laplace_bad_acc.c

Flags:

-acc: activates the OpenACC compilation
-Minfo=accel: Informs you about the accelerators messages

Output:

main:
     31, Generating implicit copyout(Temperature[1:1000][1:1000])
         Generating implicit copyin(Temperature_previous[:][:])
     32, Loop is parallelizable
     33, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         32, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
         33, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
     41, Generating implicit copy(Temperature_previous[1:1000][1:1000])
         Generating implicit copyin(Temperature[1:1000][1:1000])
     42, Loop is parallelizable
     43, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         42, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
         43, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
         44, Generating implicit reduction(max:worst_dt)

You can modify the time limit and the number of the tasks, if required.

Use compiler pgf90 for Fotran and pgc++ for C++

Options: -ta=tesla:cc60 -Minfo=all,intensity

Minfo=all It will provide all the compiler information (including the acceleration), for example Loop not vectorized: loop count too small Loop unrolled 6 times (completely unrolled)
Minfo=intensity Provides the intensity of all the loops, intensity is the (Compute operations/Memory Operations), if it is more or equal to 1.0 then we should move this loop to GPUs, otherwise not.

Execution

Submission script:

#!/bin/bash 
#SBATCH --partition=batch 
#SBATCH --job-name="test" 
#SBATCH --gres=gpu:p100:1
#SBATCH --res=HACKATHON_TEAMX
#SBATCH --nodes=1 
#SBATCH --ntasks=1
#SBATCH --time=00:10:00 
#SBATCH --exclusive 
#SBATCH --err=JOB.%j.err 
#SBATCH --output=JOB.%j.out 
#--------------------------------------------# 
module load cuda/9.0.176
module load pgi/17.10
srun -n 1 --hint=nomultithread ./laplace_bad_acc

Submit:

sbatch submit_laplace_bad_acc.sh

Source: submission_laplace_bad_acc

Modify the X according to your team number (1-6) In the above exampe we want to use one Nvidia P100 card, if you plan to use 2 cards then declare:

#SBATCH --gres=gpu:p100:2

Profiling

Adjust the name of the binary in all job scripts

Execute:

sbatch submit_profiling_terminal.sh

Source: submission_profile_terminal

Open the output file and see the profiling information.

To use a GUI:

Use the submission file submission_profile_file
Execute:
```
nvvp results.nvprof
```

Latest version with optimized OpenACC pragmas

Repeat the previous instructions with file laplace_final_acc.c

Tips

Use PGI with MPI

module load pgi/17.10
module load pgi/17.10_openmpi

Material

Ibex cheat sheet OpenACC web page

OpenACC reference guide

OpenACC programming guide

OpenACC getting started guide

PGI Compiler guide

CUDA Cuda with C/C++ Cuda with Fortran

GPU Libraries GPUs libraries

Matlab and GPU

Deep Neural Networks - Cudnn Load:

module load cudnn

GPUHACKATHON18

KAUST GPU HACKATHON 2018

Tips

Material