**RHT** is another image analysis method that is used to locate filamentary structures. In some respects it is a special case of the general template matching where the template is (in the simplest version) a linear line segment. Like in our template matching code, one starts the analysis by subtracting from the original image a filtered image (usually calculated using a top-hat filter). One is also comparing the resulting image to a “template”, a description of a linear structure. However, there are some fundamental differences:

(1) After subtraction of the filtered data, the image is thresholded at zero. Pixels with negative values are replaced with zeros and pixels with positive values are replaced with ones. In template matching this quantisation of the input data was not done.

(2) At each pixel position and a for number of position angles, one is effectively calculating the match between a “template” and the data. However, because the data now consists only of ones and zeros, the match (number of positive pixels matching a line) is not dependent on the original pixel intensities.

A description of RHT applied to astronomical image analysis can be found in Clark, Peek, & Putman (2014) and a Python implementation of the routine is available in GitHub (seclark/RHT).

The analysis of large images can take so time, so we were interested in testing how easy it would be to make a parallel version using OpenCL. First, because of the limited extent of the filter function, the convolution can be done effectively in real space (instead of using Fourier methods). Second, the actual RHT calculation consists of similar operations applied at the location of each map pixel, for a large number of potential position angles. This makes the problem (just like the general template matching) very suitable for parallelisation and for GPU computing.

Below is a minimal OpenCL version that reads in a FITS image, does the RHT analysis, and saves the outcome to a FITS file. The results consist of the RHT intensity *R* for each pixel (integrated over angle) and an estimate of the position angle for each pixel, both stored as 2D images. The third header unit of the FITS file contains a table of coordinates for the pixels where *R* was above a predetermined threshold. The table also includes for each pixel a histogram of *R* as a function of the position angle. Using a couple of standard Python libraries and the pyopencl library, the whole code takes only 191 lines! The program includes two kernels, one for the filtering and one for the actual RHT calculations. In this example the kernel code is included as a string inside the main Python script.

import sys, time import numpy as np from astropy.io import fits as pyfits import pyopencl as cl # -*- coding: utf-8 -*- # Copyright (c) 2018, Mika Juvela, (subject to the FreeBSD License) kernel_source = \ """ __kernel void Smooth(__global float *S, // input image [N] __global float *SS) { // filtered image // Pre-filtering of an input image before RHT calculation with top-hat function. int id = get_global_id(0) ; // one output pixel per work item if (id>=(N*M)) return ; // Array S[i,j] has dimensions [N,M] int i0 = id % M ; // column 0 <= i < M int j0 = id / M ; // row 0 <= j < N int RK = (DK-1)/2 ; // DK should be an odd number ! int R2 = RK*RK ; float sum=0.0f, wsum=0.0f, r2 ; for(int i=max(0, i0-RK); i<min(i0+RK+1, M); i++) { // row for(int j=max(0, j0-RK); j<min(j0+RK+1, N); j++) { // column r2 = (i-i0)*(i-i0)+(j-j0)*(j-j0) ; if (r2<=R2) { if (S[i+M*j]!=0.0f) { // 0.0 denotes here a missing value! wsum += 1.0f ; sum += S[i+M*j] ; } } } } if (wsum>1.0e-10f) { if ((S[id]-sum/wsum)>0.0f) SS[id] = 1.0f ; // ... thresholded at zero else SS[id] = 0.0f ; } else { SS[id] = 0.0f ; } } __kernel void R_kernel_oneline(__global float *SS, // thresholded image SS[N,M] __global float *RBP, // backprojection R[N,M] __global float *THETA, // average angle THETA[N,M] __global float *THETAS) { // theta histograms [j,i,itheta] // Basic version of RHT -- one steps along a single line, steps of 1.0 pixels int id = get_global_id(0) ; if (id>=(N*M)) return ; int i0 = id % M ; // column index in the image -- 0 <= i < M int j0 = id / M ; // row index int RW = (DW-1)/2 ; // RW=radius of the circular window; DW should be an odd number if ((i0<(1+RW))||(i0>=(M-1-RW))||(j0<(1+RW))||(j0>(N-1-RW))) { // ignore image borders RBP[id] = 0 ; THETA[id] = 0.0f ; return ; } int val, sum=0 ; float Q=0.0f, U=0.0f, si, co, theta ; __global float *thetas = &THETAS[id*NDIR] ; // histogram R[j,i,theta] for current pixel for(int k=0; k<NDIR; k++) { // for each directions theta = k*M_PI_F/NDIR ; si = sin(theta) ; co = cos(theta) ; val = 0 ; for(int l=-RW; l<=+RW; l++) { // loop along the line, DW=2*RW+1 points int i = (int)floor(i0+0.5f-l*si) ; // reference is the centre of the pixel int j = (int)floor(j0+0.5f+l*co) ; if (SS[i+M*j]>0.0f) val++ ; // val == ON pixels along the line } val = (val<THRESHOLD) ? 0 : (val-THRESHOLD) ; // thresholded number of ON pixels sum += val ; co = cos(2.0f*theta) ; // re-use si and co variables si = sin(2.0f*theta) ; Q += val*co ; // calculating integral of R*sin(2heta) over angle U += val*si ; thetas[k] = val/(float)DW ; // single bin in the histogram } RBP[id] = sum * (M_PI_F/(NDIR*DW)) ; // straight sum over angles (~backprojection) THETA[id] = 0.5f*atan2(U, Q) ; // [radians], RHT angle expectation value } """ def InitCL(GPU=False): """ Initialise OpenCL. Usage: platform, device, context, queue, mf = InitCL(GPU) Input: GPU = if True, try to find a GPU instead of CPU Return: platform, device, context, queue, mf(=flags) """ platform, device, context, queue = None, None, None, None for iplatform in range(3): # try up to three different OpenCL platforms (if exist) try: platform = cl.get_platforms()[iplatform] device = platform.get_devices([cl.device_type.CPU, cl.device_type.GPU][GPU>0]) context = cl.Context(device) queue = cl.CommandQueue(context) break except: pass mf = cl.mem_flags return platform, device, context, queue, mf def RollingHoughTransform(F, DK, DW, FRAC, GPU=0): """ Calculate Rolling Hough Transform for image F. Usage: R, THETA, J, I, THETAS = RollingHoughTransform(F, DK, DW, FRAC, GPU=0) Input: F = pyfits image to be analysed DK = size of convolution kernel = diameter [pixels] DW = diameter of the circular region [pixels] FRAC = fraction of ON pixels used as the threshold GPU = if >0, use GPU instead of CPU Output: R[j,i], theta, J, I, R[j,i,itheta] """ # Initialise OpenCL environment and compile the kernel program. t0 = time.time() platform, device, context, queue, mf = InitCL(GPU) LOCAL = [ 8, 64 ][GPU>0] # GPU uses more local work items.. N, M = F[0].data.shape # image dimensions THRESHOLD = int(FRAC*DW) # "Z*DW" NDIR = int(np.floor((3.14159/np.sqrt(2.0))*(DW-1.0)+1.0)) # number of directions OPT = " -D DK=%d -D DW=%d -D N=%d -D M=%d" % (DK, DW, N, M) OPT += " -D NDIR=%d -D THRESHOLD=%d -D FRAC=%.4ef" % (NDIR, THRESHOLD, FRAC) program = cl.Program(context, kernel_source).build(OPT) print('--- initialisation %5.2f seconds' % (time.time()-t0)) # Define buffers for data transfer between the host and the device S = np.asarray(np.ravel(F[0].data), np.float32) # original image SS = np.empty_like(S) # will be the filtered image S_buf = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=S) SS_buf = cl.Buffer(context, mf.READ_WRITE, S.nbytes) RBP_buf = cl.Buffer(context, mf.WRITE_ONLY, S.nbytes) # R[j,i] ("backprojection") THETA_buf = cl.Buffer(context, mf.WRITE_ONLY, S.nbytes) # average theta angle THETAS_buf = cl.Buffer(context, mf.WRITE_ONLY, 4*NDIR*N*M) # histograms R[j,i,itheta] # One kernel for the filtering of the map SMOO = program.Smooth SMOO.set_scalar_arg_dtypes([None, None]) t0 = time.time() GLOBAL = (int((N*M)/64)+1)*64 # total number of work items SMOO(queue, [GLOBAL,], [LOCAL,], S_buf, SS_buf) # SS_buf will contained the filtered image cl.enqueue_copy(queue, SS, SS_buf) # filtered image copied to SS print('--- smoothing %5.2f seconds' % (time.time()-t0)) # Another kernel for the RHT calculation t0 = time.time() RHT = program.R_kernel_oneline RHT.set_scalar_arg_dtypes([None, None, None, None]) t0 = time.time() RHT(queue, [GLOBAL,], [LOCAL,], SS_buf, RBP_buf, THETA_buf, THETAS_buf) RBP = np.zeros((N,M), np.float32) THETA = np.zeros((N,M), np.float32) THETAS = np.zeros((N*M, NDIR), np.float32) cl.enqueue_copy(queue, RBP, RBP_buf) # copy results from device to host cl.enqueue_copy(queue, THETA, THETA_buf) cl.enqueue_copy(queue, THETAS, THETAS_buf) print('--- RHT %5.2f seconds' % (time.time()-t0)) m = np.nonzero(RBP>0.0) # selected pixels J, I = m[0], m[1] THETAS = THETAS[np.nonzero(np.ravel(RBP)>0.0)[0],:] return RBP, THETA, J, I, THETAS if (len(sys.argv)<2): print("\nUsage: rht_cl.py filename [options]") print(" options:") print(" -w # window diameter in pixels (default 55)") print(" -s # smoothing radius in pixels (default 15)") print(" -t # threshold (default 0.7)") print(" -g # 0 or 1, whether to use GPU (default 0)") print(" output:") print(" FITS file result.fits") print(" F[1] = image containing R backprojection") print(" F[2] = image containing angles ") print(" F[3] = { i, j, { R[j,i,theta] } for pixels with R above threshold\n") sys.exit () tx, ARG = time.time(), dict() for i in range(len(sys.argv)/2-1): ARG.update({ sys.argv[2+2*i] : sys.argv[2+2*i+1] }) filename = sys.argv[1] s, w, z, g = 15, 55, 0.7, 0 if ('-s' in ARG): s = int( ARG['-s']) # *radius* of smoothing kernel if ('-w' in ARG): w = int( ARG['-w']) # *diameter* of the circular RHT window if ('-t' in ARG): t = float(ARG['-t']) # threshold for R[j,i,theta] values if ('-g' in ARG): g = int( ARG['-g']) # GPU selection F = pyfits.open(filename) R, T, J, I, THETAS = RollingHoughTransform(F, 2*s+1, w, z, GPU=g) hdu1 = pyfits.PrimaryHDU(header=F[0].header, data=R) hdu2 = pyfits.ImageHDU(header=F[0].header, data=T) col1 = pyfits.Column(name='hi', format='1I', array=I) col2 = pyfits.Column(name='hj', format='1I', array=J) col3 = pyfits.Column(name='hthets', format='%dE' % THETAS.shape[1], array=THETAS) cols = pyfits.ColDefs([col1, col2, col3]) hdu3 = pyfits.BinTableHDU.from_columns(cols) G = pyfits.HDUList([hdu1, hdu2, hdu3]) G.writeto('result.fits', overwrite=True) print("\nrht_cl %s -s %d -w %d -t %.2f -g %d (%.2f seconds total)\n" % (filename, s, w, z, g, time.time()-tx))

As expected, the performance is good compared to pure Python. The figure below shows how the total run time scales with the size of the analysed image. The size of the smoothing kernel (diameter 30 pixels) and the size of the “template” (diameter of the circular area where different position angles are tested; here 55 pixels) were kept constant.

The figure shows the total run time of the script (in seconds) as a function of the image size that ranges from 300×300 pixels up to 9500×9500 pixels. These are shown for both a CPU (with 6 cores) and a GPU. For very small images the constant overheads dominate and the run times are equal. For larger images, the GPU becomes faster – but the difference is not very significant. This is mainly because the serial Python code stands for almost half of the total run time. This is in turn related to the large data contained in the angle histograms, which are also all saved to a FITS file. We actually ran out of memory before reaching the largest image sizes. This is because in the script we allocate buffers to save results for the whole image, each pixel having a histogram of more than 100 bins. This could of course be avoided by calling the kernel repeatedly, each time analysing only part of the image.

We made also a lighter version of the RHT program that returns only *R* (for each pixel but not as a function of the position angle) and the average position angle that the kernel has calculated based on the full histograms. The histograms are not kept in memory and are not saved to the disk. In this case (dashed curves, “no hist.”, in the figure), Python overhead is significantly reduced and the GPU runs are now up to more than an order of magnitude faster than the CPU runs. The largest 9500×9500 pixel image is processed on the GPU in under four seconds and most of that time is now spent on calculations within the kernels. This also suggests that, if possible, any analysis involving the full position angle histograms should be done inside the kernels, instead of trying to dump all data always to external files.

*The tests were run on a laptop with a six core i7-8700K CPU running at 3.70 GHz and with an Nvidia GTX-1080 GPU*.