The SOC radiative transfer program was implemented as a combination of Python main program and OpenCL kernels, using the PyOpenCL library. Initial conversion to Julia language and the use of KernelAbstractions.jl library showed that the Julia option was at least 30% slower.
To understand this difference, we examined a smaller example that was reduced from original SOC code to about hundred lines of code, including one kernel and two other kernel-side routines. In addition to the Python+OpenCL and Julia+KernelAbstractions, we also included in the comparison Julia with the OpenCL.jl library, using the same OpenCL kernel as with PyOpenCL.
Here is the Python main program
#!/usr/bin/env python
import pyopencl as cl
from numpy import *
import time
import sys
t00 = time.time()
N = 128 # size of "density" grid
NRAY = 10000 # number of "rays"
USE_GPU = 1 # 0 = use CPU, 1 = use GPU
ATOMICS = 0 # 0 = no atomics, 1 = use atomic updates
if (len(sys.argv)>1):
USE_GPU = int(sys.argv[1])
ATOMICS = int(sys.argv[2])
if (USE_GPU>0):
GLOBAL = 1024
LOCAL = 32
platform = cl.get_platforms()[0] # test system "0" = POCL OpenCL
device = [platform.get_devices(cl.device_type.GPU)[USE_GPU-1],] # "0" = CPU
else:
GLOBAL = 32
LOCAL = 4
platform = cl.get_platforms()[1] # test system "1" = NVidia
device = [platform.get_devices(cl.device_type.CPU)[0],] # "0" = NVidia 5080 mobile
print(device)
context = cl.Context(device)
queue = cl.CommandQueue(context)
mf = cl.mem_flags
DENS = ones(N*N*N, float32) # Cartesian grid of N^3 cells, input "density"
DENS_buf = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=DENS)
TABS = zeros(N*N*N, float32) # Cartesian grid of N^3 cells, kernel output
TABS_buf = cl.Buffer(context, mf.READ_WRITE, 4*N*N*N)
cl.enqueue_copy(queue, TABS_buf, TABS)
cl.enqueue_copy(queue, DENS_buf, DENS)
ARGS = " -D N=%d -D NRAY=%d -D ATOMICS=%d -D NVIDIA=%d" % (N, NRAY, ATOMICS, USE_GPU>0)
# ARGS += " -cl-mad-enable -cl-no-signed-zeros -cl-finite-math-only" # additional optimisation
source = open("kernel_test5.c").read()
program = cl.Program(context, source).build(ARGS)
kernel_test5 = program.test5
t0 = time.time()
for ifreq in range(10):
t1 = time.time()
kernel_test5(queue, [GLOBAL,], [LOCAL,], DENS_buf, TABS_buf)
cl.enqueue_copy(queue, TABS, TABS_buf)
print(" %.4f seconds" % (time.time()-t1)) # timing per kernel call
queue.finish()
print("%.4f seconds total, result %.3e" % (time.time()-t00, sum(ravel(TABS)))) # total time
and here the OpenCL kernel code
#define EPS 1.0e-4f
// #define ATOMICS 1
#if ATOMICS>0
#if (NVIDIA>0)
float atomicAdd_g_f(__global float* p, float val) {
float prev;
asm volatile(
"atom.global.add.f32 %0, [%1], %2;"
: "=f"(prev)
: "l"(p) , "f"(val)
: "memory"
);
return prev;
}
#else
inline void atomicAdd_g_f(volatile __global float *addr, float val) {
union{
unsigned int u32;
float f32;
} next, expected, current;
current.f32 = *addr;
do {
expected.f32 = current.f32;
next.f32 = expected.f32 + val;
current.u32 = atomic_cmpxchg( (volatile __global unsigned int *)addr,
expected.u32, next.u32);
} while( current.u32 != expected.u32 );
}
#endif
#endif
int Index(const float x, const float y, const float z) {
if ((x<=0.0f)||(y<=0.0f)||(z<=0.0f)||(x>=N)||(y>=N)||(z>=N)) {
return -1 ;
}
return (int)(floor(x)+N*floor(y)+N*N*floor(z)) ;
}
void Step(float *x, float *y, float *z, const float u, const float v, const float w, float *ds, int *ind) {
float s ;
s = (u>0.0f) ? ((1.0f+EPS-fmod(*x, 1.0f))/u) : ((-EPS-fmod(*x, 1.0f))/u) ;
s = min(s, (v>0.0f) ? ((1.0f+EPS-fmod(*y, 1.0f))/v) : ((-EPS-fmod(*y, 1.0f))/v)) ;
*ds = min(s, (w>0.0f) ? ((1.0f+EPS-fmod(*z, 1.0f))/w) : ((-EPS-fmod(*z, 1.0f))/w)) ;
*x += (*ds)*u ;
*y += (*ds)*v ;
*z += (*ds)*w ;
*ind = Index(*x, *y, *z) ;
}
_kernel void test5(__global float *DENS, __global float *TABS) {
const int id = get_global_id(0) ;
float x = 1.0f, y = 1.0f, z=1.0f, u, v, w, tmp, ds ;
int ind, oind ;
for(int iray=0; iray<NRAY ; iray++) {
u = x ; v = 0.1f ; w = 0.5f ;
tmp = sqrt(u*u+v*v+w*w) ;
u /= tmp ; v /= tmp ; w /= tmp ;
x = 39.5f ; y = 39.5f ; z = EPS ;
ind = Index(x, y, z) ;
while(ind>=0) {
oind = ind ;
Step(&x, &y, &z, u, v, w, &ds, &ind) ;
#if ATOMICS
atomicAdd_g_f(&(TABS[oind]), ds) ;
#else
TABS[oind] += exp(-ds) ;
#endif
}
}
}
The corresponding Julia main program, using the same c-language kernel, was
#!/usr/bin/env julia
using Printf
using Statistics
using OpenCL
const N = Int32(128)
const NRAY = 10000
t00 = time()
USE_GPU = 1
ATOMICS = 1
if (length(ARGS)==2)
USE_GPU = parse(Int32, ARGS[1]) # 0 = use CPU, 1 = use GPU
ATOMICS = parse(Int32, ARGS[2])
end
if (USE_GPU>0)
platform = cl.platforms()[1] # on test system "1" = NVidia
device = cl.devices(platform)[1] # on test system "1" = NVidia 4080 mobile
cl.device!(device)
GLOBAL = 1024
LOCAL = 32
else
platform = cl.platforms()[3] # on test system "3" = POCL OpenCL for CPU
device = cl.devices(platform)[1]
cl.device!(device)
GLOBAL = 32
LOCAL = 4
end
println(device)
fp = open("kernel_test5.c")
source = read(fp, String)
# P = cl.Program(; source) |> cl.build!
P0 = cl.Program(; source)
P = cl.build!(P0; options=@sprintf("-D ATOMICS=%d -D N=%d -D NRAY=%d -D NVIDIA=%d", ATOMICS, N, NRAY, USE_GPU))
K = cl.Kernel(P, "test5")
DENS = ones(Float32, N*N*N)
TABS = zeros(Float32, N*N*N)
D_DENS = CLArray(DENS)
D_TABS = CLArray(TABS)
function CL_test()
global DENS, TABS, D_DENS, D_TABS, LOCAL, GLOBAL
for ifreq=1:10
t1 = time()
clcall(K, Tuple{ CLPtr{Float32}, CLPtr{Float32}}, D_DENS, D_TABS; global_size=GLOBAL, local_size=LOCAL)
TABS = Array(D_TABS)
@printf(" %.5f seconds\n", time()-t1) # run time pre kernel call
end
end
CL_test()
@printf("%.4f seconds total -- result %.3e\n", time()-t00, sum(TABS)) # total run time
Finally, below is the entire Julia code using KernalAbstractions:
#!/usr/bin/env julia
using KernelAbstractions
const KA = KernelAbstractions
using Printf
# using Statistics
using Atomix
const N = Int32(128)
const EPS = 1.0f-4
const NRAY = Int32(10000)
t00 = time()
if (length(ARGS)==2)
const USE_GPU = parse(Int32, ARGS[1]) # 0 = CPU, 1 = GPU
const ATOMICS = parse(Int32, ARGS[2]) # 0 = no atomics, 1 = with atomic updates
else
const USE_GPU = false
const ATOMICS = false
end
if USE_GPU>0
using CUDA
using CUDA.CUDAKernels
BACK = CUDABackend()
CUDA.device!(0) # on test system "0" = NVidia 5080 mobile
println(CUDA.device())
CUDA.allowscalar(false)
const GLOBAL = Int64(1024)
const LOCAL = Int64(32)
else
BACK = CPU() # on test sytstem = POCL OpenCL on CPU
const GLOBAL = Int32(32)
const LOCAL = Int64(4)
end
const int0 = Int32(0)
const int1 = Int32(1)
@inline @inbounds function Index(x::Float32, y::Float32, z::Float32)::Int32
if ((x<=0.0f0)|(x>=N)|(y<=0.0f0)|(y>=N)|(z<=0.0f0)|(z>=N))
return int0
end
return Int32(1.0f0 + floor(floor(x)+N*floor(y)+N*N*floor(z)))
end
@inline @inbounds function Step(x::Float32, y::Float32, z::Float32, u::Float32, v::Float32, w::Float32)
ds = min(
(u>0.0f0) ? ((1.0f0+EPS-mod(x,1.0f0))/u) : ((-EPS-mod(x,1.0f0))/u),
(v>0.0f0) ? ((1.0f0+EPS-mod(y,1.0f0))/v) : ((-EPS-mod(y,1.0f0))/v),
(w>0.0f0) ? ((1.0f0+EPS-mod(z,1.0f0))/w) : ((-EPS-mod(z,1.0f0))/w)
)
x += ds*u
y += ds*v
z += ds*w
ind = Index(x, y, z)::Int32
return ds, ind, x, y, z
end
if (ATOMICS>0)
@kernel inbounds=true unsafe_indices=true function kernel_test5(@Const(DENS), TABS)
id = @index(Global, Linear) # 3.6 sec
id -= 1
x = 1.0f0 ; y = 1.0f0 ; z = 1.0f0 ; u = 0.0f0 ; v = 0.0f0 ; w = 0.0f0 ; ds = 0.0f0
for iray=1:NRAY
u = x ; v = 0.1f0 ; w = 0.5f0 ;
tmp = sqrt(u*u+v*v+w*w) ;
u /= tmp ; v /= tmp ; w /= tmp ;
x = 39.5f0 ; y = 39.5f0 ; z = EPS ;
ind = Index(x, y, z)::Int32
while(ind>=int1)
oind = ind
ds, ind, x, y, z = Step(x, y, z, u, v, w)
Atomix.@atomic TABS[oind] += ds
end
end
end
else
@kernel inbounds=true unsafe_indices=true function kernel_test5(@Const(DENS), TABS)
id = @index(Global, Linear) # 3.6 sec
id -= 1
x = 1.0f0 ; y = 1.0f0 ; z = 1.0f0 ; u = 0.0f0 ; v = 0.0f0 ; w = 0.0f0 ; ds = 0.0f0
for iray=1:NRAY
u = x ; v = 0.1f0 ; w = 0.5f0 ;
tmp = sqrt(u*u+v*v+w*w) ;
u /= tmp ; v /= tmp ; w /= tmp ;
x = 39.5f0 ; y = 39.5f0 ; z = EPS ;
ind = Index(x, y, z)::Int32
while(ind>=int1)
oind = ind
ds, ind, x, y, z = Step(x, y, z, u, v, w)
TABS[oind] += exp(-ds)
end
end
end
end
K_test5! = kernel_test5(BACK, LOCAL)
DENS = ones(Float32, N*N*N)
TABS = zeros(Float32, N*N*N)
D_DENS = KA.zeros(BACK, Float32, N*N*N)
D_TABS = KA.zeros(BACK, Float32, N*N*N)
KA.copyto!(BACK, D_DENS, DENS)
KA.copyto!(BACK, D_TABS, TABS)
function KA_test()
t0 = time()
for ifreq=1:10
t1 = time()
K_test5!(D_DENS, D_TABS, ndrange=(GLOBAL))
KA.copyto!(BACK, TABS, D_TABS)
# KA.synchronize(BACK)
@printf(" %.4f seconds\n", time()-t1) # run time per kernel call
end
end
KA_test()
@printf("%.4f seconds total -- result %.3e\n", time()-t00, sum(TABS)) # total run time
In each run with the above three programs (Python+OpenCL, Julia+OpenCL, Julia+KernelAbstractions) there are four variants, whether one uses CPU or GPU and whether the kernel makes updates using atomic operations or not. For correct results the updates should be atomic. However, since the implementations of the atomic operations are different (in particular, in the above OpenCL kernels implemented within the kernel and differently for CPU and GPU), we measured run times for comparison also without the atomics.
Each run includes initialisations and ten calls to the kernel routine. The following figure shows the run times (time per one kernel call), with the x-axis corresponding to the ten kernel calls included in one run. The first iteration(s) are affected by initialisation overhead (code compilation). For a short run (e.g. if run included only a single kernel call) the initialisation will be a significant part of the full run time. Conversely, the run times of the later iterations will be more representative of the relative performance in longer runs.

The Python and Julia OpenCL programs should have very similar timings because they share the same kernel code and the main programs stand for a small fraction of the run times. This is also the case, both on CPU and GPU. The first iteration takes longer in the Julia version (perhaps previously compiled kernel was not cached?).
On the other hand, the performance of Julia + KernelAbstractions shows clear deviations. Without atomics, it has the slowest performance. The difference is some 30% on GPU but on CPU the difference increases to more than a factor of two. In our previously tests with the SOC program, we found some 30% difference on both CPUs and GPUs.
In the version with atomic updates, the Julia + KernelAbstractions is on GPU still 30% slower than the OpenCL options - plus the initialisations continue to take a very long time (more than twice the OpenCL run times for the entire run with ten kernal calls). On the other hand, the use of atomic operations appears to have an almost negligeable effect on the CPU, where the KernelAbstractions version is now almost 50% faster than the OpenCL versions (if the slow initialisation are ignored…).
Based on the above, KernelAbstractions might be a good option for CPU calculations. However, as observed in tests with full SOC-program kernels, it can be (especially on NVidia) on GPUs unfortunately slower than OpenCL. The 30% difference is large anough to make one hesitate to commit to replacing Python + OpenCL with Julia + KernelAbstractions. There may not be a good reason for this performance difference, and one may hope that the situation will be improved by future library updates. Alternatively, it is possible that our KernelAbstractions implementation above is somehow flawed - in which case somebody will hopefully let us know how the performance could be improved (e.g. via GitHub, see below).
The tests were run on a laptop with Ultra 9 275HX CPUs (8+16 cores) and NVidia 5080 mobile GPU. The scripts can be found at GitHub , in the Julia_test subdirectory (test5.py for the python version, test5_cl.jl for the Julia OpenCL and test5.jl for the Julia KernelAbstractions version).