RHT — Rolling Hough Transform

RHT is another image analysis method that is used to locate filamentary structures. In some respects it is a special case of the general template matching where the template is (in the simplest version) a linear line segment. Like in our template matching code, one starts the analysis by subtracting from the original image a filtered image (usually calculated using a top-hat filter). One is also comparing the resulting image to a “template”, a description of a linear structure. However, there are some fundamental differences:

(1) After subtraction of the filtered data, the image is thresholded at zero. Pixels with negative values are replaced with zeros and pixels with positive values are replaced with ones. In template matching this quantisation of the input data was not done.

(2) At each pixel position and a for number of position angles, one is effectively calculating the match between a “template” and the data. However, because the data now consists only of ones and zeros, the match (number of positive pixels matching a line) is not dependent on the original pixel intensities.

A description of RHT applied to astronomical image analysis can be found in Clark, Peek, & Putman (2014) and a Python implementation of the routine is available in GitHub (seclark/RHT).

The analysis of large images can take so time, so we were interested in testing how easy it would be to make a parallel version using OpenCL. First, because of the limited extent of the filter function, the convolution can be done effectively in real space (instead of using Fourier methods). Second, the actual RHT calculation consists of similar operations applied at the location of each map pixel, for a large number of potential position angles. This makes the problem (just like the general template matching) very suitable for parallelisation and for GPU computing.

Below is a minimal OpenCL version that reads in a FITS image, does the RHT analysis, and saves the outcome to a FITS file. The results consist of  the RHT intensity R for each pixel (integrated over angle) and an estimate of the position angle for each pixel, both stored as 2D images. The third header unit of the FITS file contains a table of coordinates for the pixels where R was above a predetermined threshold. The table also includes for each pixel a histogram of R as a function of the position angle. Using a couple of standard Python libraries and the pyopencl library, the whole code takes only 191 lines!  The program includes two kernels, one for the filtering and one for the actual RHT calculations. In this example the kernel code is included as a string inside the main Python script.

import sys, time
import numpy as np
from   astropy.io import fits as pyfits
import pyopencl as cl
# -*- coding: utf-8 -*-
# Copyright (c) 2018, Mika Juvela, (subject to the FreeBSD License)

kernel_source = \
__kernel void Smooth(__global float *S,      // input image [N]
                     __global float *SS) {   // filtered image
   // Pre-filtering of an input image before RHT calculation with top-hat function.
   int  id  = get_global_id(0) ;  // one output pixel per work item
   if (id>=(N*M)) return ;  // Array S[i,j] has dimensions [N,M]
   int i0 = id % M ;        // column   0 <= i < M
   int j0 = id / M ;        // row      0 <= j < N
   int RK = (DK-1)/2 ;      // DK should be an odd number !
   int R2 = RK*RK ;
   float sum=0.0f, wsum=0.0f, r2 ;
   for(int i=max(0, i0-RK); i<min(i0+RK+1, M); i++) {       // row
      for(int j=max(0, j0-RK); j<min(j0+RK+1, N); j++) {    // column
         r2 = (i-i0)*(i-i0)+(j-j0)*(j-j0) ;
         if (r2<=R2) { 
            if (S[i+M*j]!=0.0f) { // 0.0 denotes here a missing value! 
               wsum += 1.0f ; 
               sum += S[i+M*j] ; 
   if (wsum>1.0e-10f) {
      if ((S[id]-sum/wsum)>0.0f) SS[id] = 1.0f ;  // ... thresholded at zero
      else                       SS[id] = 0.0f ;
   } else {
      SS[id] = 0.0f ;

__kernel void R_kernel_oneline(__global float *SS,        // thresholded image SS[N,M]
                               __global float *RBP,       // backprojection R[N,M]
                               __global float *THETA,     // average angle THETA[N,M]
                               __global float *THETAS) {  // theta histograms [j,i,itheta]
   // Basic version of RHT -- one steps along a single line, steps of 1.0 pixels
   int id = get_global_id(0) ;
   if (id>=(N*M)) return ;
   int i0 = id % M ;      // column index in the image --   0 <= i < M
   int j0 = id / M ;      // row index
   int RW  = (DW-1)/2 ;   // RW=radius of the circular window; DW should be an odd number
   if ((i0<(1+RW))||(i0>=(M-1-RW))||(j0<(1+RW))||(j0>(N-1-RW))) {  // ignore image borders
      RBP[id] = 0 ; THETA[id] = 0.0f  ;  return ;
   int   val, sum=0 ;
   float Q=0.0f, U=0.0f, si, co, theta ;
   __global float *thetas = &THETAS[id*NDIR] ; // histogram R[j,i,theta] for current pixel
   for(int k=0; k<NDIR; k++) {               // for each directions
      theta =  k*M_PI_F/NDIR ;
      si    =  sin(theta) ;  co = cos(theta) ;
      val   =  0 ;
      for(int l=-RW; l<=+RW; l++) { // loop along the line, DW=2*RW+1 points 
         int i = (int)floor(i0+0.5f-l*si) ; // reference is the centre of the pixel 
         int j = (int)floor(j0+0.5f+l*co) ; 
         if (SS[i+M*j]>0.0f) val++ ;        // val == ON pixels along the line
      val   =  (val<THRESHOLD) ? 0 : (val-THRESHOLD) ; // thresholded number of ON pixels 
      sum += val ; 
      co = cos(2.0f*theta) ; // re-use si and co variables 
      si = sin(2.0f*theta) ; 
      Q += val*co ; // calculating integral of R*sin(2heta) over angle 
      U += val*si ; 
      thetas[k] = val/(float)DW ; // single bin in the histogram 
    RBP[id] = sum * (M_PI_F/(NDIR*DW)) ; // straight sum over angles (~backprojection) 
    THETA[id] = 0.5f*atan2(U, Q) ; // [radians], RHT angle expectation value } 

def InitCL(GPU=False): 
   Initialise OpenCL. 
   Usage: platform, device, context, queue, mf = InitCL(GPU) 
      GPU     = if True, try to find a GPU instead of CPU 
      platform, device, context, queue, mf(=flags) 
   platform, device, context, queue = None, None, None, None 
   for iplatform in range(3): # try up to three different OpenCL platforms (if exist) 
         platform = cl.get_platforms()[iplatform] 
         device   = platform.get_devices([cl.device_type.CPU, cl.device_type.GPU][GPU>0])
         context  = cl.Context(device)
         queue    = cl.CommandQueue(context)
    mf        = cl.mem_flags
    return platform, device, context, queue, mf
def RollingHoughTransform(F, DK, DW, FRAC, GPU=0):
    Calculate Rolling Hough Transform for image F.
    Usage:  R, THETA, J, I, THETAS  =  RollingHoughTransform(F, DK, DW, FRAC, GPU=0)
        F         =  pyfits image to be analysed
        DK        =  size of convolution kernel = diameter [pixels]
        DW        =  diameter of the circular region [pixels]
        FRAC      =  fraction of ON pixels used as the threshold
        GPU       =  if >0, use GPU instead of CPU
        R[j,i], theta, J, I, R[j,i,itheta]
    # Initialise OpenCL environment and compile the kernel program.
    t0         =  time.time()
    platform, device, context, queue, mf = InitCL(GPU)
    LOCAL      =  [ 8, 64 ][GPU>0] # GPU uses more local work items..
    N, M       =  F[0].data.shape  # image dimensions
    THRESHOLD  =  int(FRAC*DW)     # "Z*DW"
    NDIR       =  int(np.floor((3.14159/np.sqrt(2.0))*(DW-1.0)+1.0)) # number of directions
    OPT        =  " -D DK=%d -D DW=%d -D N=%d -D M=%d" % (DK, DW, N, M)
    OPT       +=  " -D NDIR=%d  -D THRESHOLD=%d -D FRAC=%.4ef" % (NDIR, THRESHOLD, FRAC)
    program    =  cl.Program(context, kernel_source).build(OPT)
    print('--- initialisation  %5.2f seconds' % (time.time()-t0))
    # Define buffers for data transfer between the host and the device
    S          =  np.asarray(np.ravel(F[0].data), np.float32)   # original image
    SS         =  np.empty_like(S)                              # will be the filtered image
    S_buf      =  cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=S)
    SS_buf     =  cl.Buffer(context, mf.READ_WRITE, S.nbytes)
    RBP_buf    =  cl.Buffer(context, mf.WRITE_ONLY, S.nbytes)   # R[j,i] ("backprojection")
    THETA_buf  =  cl.Buffer(context, mf.WRITE_ONLY, S.nbytes)   # average theta angle 
    THETAS_buf =  cl.Buffer(context, mf.WRITE_ONLY, 4*NDIR*N*M) # histograms R[j,i,itheta]
    # One kernel for the filtering of the map
    SMOO      =  program.Smooth
    SMOO.set_scalar_arg_dtypes([None, None])
    t0        =  time.time()
    GLOBAL    =  (int((N*M)/64)+1)*64               # total number of work items
    SMOO(queue, [GLOBAL,], [LOCAL,], S_buf, SS_buf) # SS_buf will contained the filtered image
    cl.enqueue_copy(queue, SS, SS_buf)              # filtered image copied to SS
    print('--- smoothing       %5.2f seconds' % (time.time()-t0))
    # Another kernel for the RHT calculation
    t0        =  time.time()
    RHT       =  program.R_kernel_oneline
    RHT.set_scalar_arg_dtypes([None, None, None, None])
    t0        =  time.time()
    RHT(queue, [GLOBAL,], [LOCAL,], SS_buf, RBP_buf, THETA_buf, THETAS_buf)
    RBP       =  np.zeros((N,M), np.float32)
    THETA     =  np.zeros((N,M), np.float32)
    THETAS    =  np.zeros((N*M, NDIR), np.float32)
    cl.enqueue_copy(queue, RBP,     RBP_buf)        # copy results from device to host
    cl.enqueue_copy(queue, THETA,   THETA_buf)
    cl.enqueue_copy(queue, THETAS,  THETAS_buf)
    print('--- RHT             %5.2f seconds' % (time.time()-t0))
    m         =  np.nonzero(RBP>0.0)                # selected pixels
    J, I      =  m[0], m[1]
    THETAS    =  THETAS[np.nonzero(np.ravel(RBP)>0.0)[0],:]
    return  RBP, THETA, J, I, THETAS

if (len(sys.argv)<2):
    print("\nUsage:   rht_cl.py  filename [options]")
    print("  options:")
    print("     -w  #       window diameter in pixels (default 55)")
    print("     -s  #       smoothing radius in pixels (default 15)")
    print("     -t  #       threshold (default 0.7)")
    print("     -g  #       0 or 1, whether to use GPU (default 0)")
    print("  output:")
    print("     FITS file result.fits")
    print("        F[1] = image containing R backprojection")
    print("        F[2] = image containing angles ")
    print("        F[3] = { i, j, { R[j,i,theta] } for pixels with R above threshold\n")
    sys.exit ()

tx, ARG  = time.time(), dict()
for i in range(len(sys.argv)/2-1):    ARG.update({  sys.argv[2+2*i] : sys.argv[2+2*i+1] })
filename    =  sys.argv[1]
s, w, z, g  =  15, 55, 0.7, 0
if ('-s' in ARG):  s = int(  ARG['-s'])  # *radius* of smoothing kernel
if ('-w' in ARG):  w = int(  ARG['-w'])  # *diameter* of the circular RHT window
if ('-t' in ARG):  t = float(ARG['-t'])  # threshold for R[j,i,theta] values
if ('-g' in ARG):  g = int(  ARG['-g'])  # GPU selection

F          =  pyfits.open(filename)
R, T, J, I, THETAS  =  RollingHoughTransform(F, 2*s+1, w, z, GPU=g)
hdu1       = pyfits.PrimaryHDU(header=F[0].header, data=R)
hdu2       = pyfits.ImageHDU(header=F[0].header, data=T)
col1       = pyfits.Column(name='hi', format='1I', array=I)
col2       = pyfits.Column(name='hj', format='1I', array=J)
col3       = pyfits.Column(name='hthets', format='%dE' % THETAS.shape[1], array=THETAS)
cols       = pyfits.ColDefs([col1, col2, col3])
hdu3       = pyfits.BinTableHDU.from_columns(cols)
G          = pyfits.HDUList([hdu1, hdu2, hdu3])
G.writeto('result.fits', overwrite=True)
print("\nrht_cl %s -s %d -w %d -t %.2f -g %d   (%.2f seconds total)\n" % (filename, s, w, z, g, time.time()-tx))

As expected, the performance is good compared to pure Python. The figure below shows how the total run time scales with the size of the analysed image. The size of the smoothing kernel (diameter 30 pixels) and the size of the “template” (diameter of the circular area where different position angles are tested; here 55 pixels) were kept constant.

The figure shows the total run time of the script (in seconds) as a function of the image size that ranges from 300×300 pixels up to 9500×9500 pixels. These are shown for both a CPU (with 6 cores) and a GPU. For very small images the constant overheads dominate and the run times are equal. For larger images, the GPU becomes faster – but the difference is not very significant. This is mainly because the serial Python code stands for almost half of the total run time. This is in turn related to the large data contained in the angle histograms, which are also all saved to a FITS file. We actually ran out of memory before reaching the largest image sizes. This is because in the script we allocate buffers to save results for the whole image, each pixel having a histogram of more than 100 bins. This could of course be avoided by calling the kernel repeatedly, each time analysing only part of the image.

We made also a lighter version of the RHT program that returns only  R (for each pixel but not as a function of the position angle) and the average position angle that the kernel has calculated based on the full histograms. The histograms are not kept in memory and are not saved to the disk. In this case (dashed curves, “no hist.”,  in the figure), Python overhead is significantly reduced and the GPU runs are now up to more than an order of magnitude faster than the CPU runs. The largest 9500×9500 pixel image is processed on the GPU in under four seconds and most of that time is now spent on calculations within the kernels. This also suggests that, if possible, any analysis involving the full position angle histograms should be done inside the kernels, instead of trying to dump all data always to external files.

The tests were run on a laptop with a six core  i7-8700K CPU running at 3.70 GHz and with an Nvidia GTX-1080 GPU.