# HOG — histogram of oriented gradients

##### HOG

Soler et al. (2018) have recently proposed the correlation of gradient directions as a useful tool to compare astronomical images. This has some similarity to template matching and RHT analysis in that analysis is based on local quantities that are estimated at each pixel position. This time that local quantity is the local gradient, and especially the gradient direction, estimated from the pixel values in a small environment around each pixel position. Routines for HOG analysis can be found in the GitHub repository of Juan Soler.

We wrote a version, just for the comparison of two images, using Python and OpenCL. We are using two kernels. The first one does the combined smoothing and gradient estimation. This is a single operation that corresponds to the convolution of the image with derivatives of the convolving Gaussian beam. This is done in real space, extending the averaging of pixel values up to a distance of 5 standard deviations. In addition to gradients, HOG also includes calculation of some circular statistics. In the original Python version this relies on another Python library, namely pycircstat. The statistics are based on the summation of some quantities over the maps. We wrote this using another kernel where each work item calculates partial sums. These are then transferred to host (the main Python program) that sums these up.

We include here the listings of the kernel codes used in this test. The first one is for the gradient calculation. The kernel receives two images and returns table of the relative angle in the gradient directions of the images.

```__kernel void HOG_gradients(__global float *I1,    // First image [NY, NX]
__global float *I2,    // Second image [NY, NX]
__global float *phi,   // angle for the gradient direction
) {
// Calculate gradients for the images I1 and I2, return array for the relative angle phi.
// Elements where the gradient is below given gradient thresholds are set to NaN.
int   id  =  get_global_id(0) ;
int   gs  =  get_global_size(0) ;
if   (id>=(N*M)) return ;                 // image dimensions [N,M]
int   d   = (int)(5.0f*SIGMA) ;           // maximum distance to which convolution extends
int   ind, i0=id % M,  j0 = id / M ;      // indices of the current pixel
float dx, dy ;
float b   =  1.0f/(2.0f*SIGMA*SIGMA) ;    // scale in exponent
float GX1 =  0.0f, GY1=0.0f, wei=0.0f, w ;
float GX2 =  0.0f, GY2=0.0f ;
const float NaN = nan((uint)0) ;
for(int j=j0-d; j<=j0+d; j++) {           // loop over nearby pixels
for(int i=i0-d; i<=i0+d; i++) {        //  -"-
ind     =  clamp(j, 0, N-1)*M+clamp(i, 0, M-1) ; // implements "nearest"
dx      =  i-i0 ;
dy      =  j-j0 ;
w       =  exp(-b*(dx*dx + dy*dy)) ;
wei    +=  w ;
GX1    +=  I1[ind] * ( - dx * w ) ;   // local gradient in the first image
GY1    +=  I1[ind] * ( - dy * w ) ;
GX2    +=  I2[ind] * ( - dx * w ) ;   // local gradient in the second image
GY2    +=  I2[ind] * ( - dy * w ) ;
}
}
dx       =   wei*SIGMA*SIGMA ;
GX1     /=   dx ;
GY1     /=   dx ;
GX2     /=   dx ;
GY2     /=   dx ;
w        =  atan2(GX1*GY2-GY1*GX2, GX1*GX2+GY1*GY2) ;  // phi !!
w        =  atan(tan(w)) ;
dx       =  sqrt(GX1*GX1+GY1*GY1) ;
dy       =  sqrt(GX2*GX2+GY2*GY2) ;
phi[id] =  ((dx<=gthresh1)||(dy<=gthresh2)) ?  NaN :  w ;
}
```

The second kernel is even more simple, calculating a number of sums over the pixels.

```__kernel void HOG_sums(__global float *phi,
__global float *scos,
__global float *ssin,
__global float *scos2,
__global float *ssin2,
__global float *weights,
__global float *wscos,
__global float *wssin,
__global float *wscos2,
__global float *wssin2
) {
int id = get_global_id(0) ;
int gs = get_global_size(0) ;
// Compute in kernel partial sums of cos, sin, cos^2, sin^2. Host calculates the final sums.
float sc=0.0f,ss=0.0f,sc2=0.0f,ss2=0.0f,wsc=0.0f,wss=0.0f,wsc2=0.0f,wss2=0.0f, p, w, s, c ;
for(int i=id; i<(N*M); i+=gs) {
p = phi[i] ;  w = weights[i] ; s = sin(p) ; c = cos(p) ;
if (isfinite(p)) {   // NaN pixels contribute nothing to the sums
sc    +=  c ;
ss    +=  s ;
sc2   +=  c*c ;
ss2   +=  s*s ;
wsc   +=  w * c ;
wss   +=  w * s ;
wsc2  +=  w * c*c ;
wss2  +=  w * s*s ;
}
}
scos[id]    =  sc ;     // kernel computes these partial sums
ssin[id]    =  ss ;     // ... host calculates the final total sums
scos2[id]   =  sc2 ;
ssin2[id]   =  ss2 ;
wscos[id]   =  wsc ;
wssin[id]   =  wss ;
wscos2[id]  =  wsc2 ;
wssin2[id]  =  wss2 ;
}
```

We tested this on a pair of images that had an original size of 512 × 512 pixels but which in the test were scaled to sizes between 256×256 and 8192 × 8192 pixels. The following picture compares the run times, with CPU and GPU, to the original Python code. The x-axis indicates the scaling of the image dimensions.

Because of the overheads of kernel compilations and data transfers, the OpenCL versions are for the 256×256 image slower than the original pure Python program. Above 512×512 pixels they overtake the original Python program and for sizes above 1024×1024 pixels GPU becomes faster than the CPU.  By the time we reach the 8196×8196 image size, we have probably converged to the final ratio between the run times: CPU version is ~3 times and the GPU version ~20 times faster than the pure Python program.

For CPU the result is actually abysmal. By using 6 CPU cores, instead of the one employed by the original Python programme, we have gained a mere factor of three in the run time. We might have expected more than a factor of six, especially when moving from interpreted Python to compile kernel programs. At the largest image size, CPU is using some 4 seconds on the first kernel and some 1 second on the second kernel that is called twice during each run. The poor performance of the convolution/gradient kernel might be because of the decision to do the convolution in real space. Although the convolving Gaussian was here small (σ ~ one pixel), each gradient value still  requires the summation over almost 100 pixels. Perhaps fast fourier transformations would be faster… Also, because the main calculation is the convolution operation, even in the case of the original Python program this is done in a compiled library routine and Python does not introduce significant overhead. With the larger number of computing elements, GPU manages to run the first OpenCL kernel in 0.2 seconds instead of the 4 seconds it took for the CPU. Call to the second kernel takes with GPU some 0.13 seconds instead of the ~1.0 seconds for the CPU. However, just by looking at the second kernel, one can see that it is not going to be extremely fast. There is too little computation compared to the amount of data that needs to be shifted between the host and the device.

In summary, this exercise was not entirely successful. This OpenCL implementation is not going to be very useful if one is analysing only small maps. On the other hand, with 10k images and larger, the ×20 speedup of the GPU version starts to be significant.  With a top-of-the-range desktop GPU, this speed-up could be further at least doubled.

The test was run on a laptop with a six core  i7-8700K CPU running at 3.70 GHz and with an Nvidia GTX-1080 GPU.