##### NICER

NICER is a method used to estimate extinction based on the reddening of the light of background stars, that is, how much redder stars appear to be compared to their intrinsic colours. The reddening is caused by dust clouds between the stars and the observer and thus gives a measure for the amount of interstellar matter along that line of sight. NICER is a way to optimally combine measurements of several clouds. The method is described in an article by Lombardi & Alves (2001).

The calculation of a map of extinction consists of two step. First, one estimates the extinction for every star in the field. Second, one calculates extinction values for each pixel in the map. Unlike stars, pixels form a regular grid on the sky and each pixel value is obtained as a weighted average over the extinction values of stars near that pixel. Both steps should be easy to parallelise: extinction estimates of individual stars can be calculated in parallel and similarly the averages for each pixel can be calculated in parallel.

We wrote a Python program that used two OpenCL kernels that correspond to the two steps mentioned above. The task of the main program is only to read in the input data, call the kernels, and write the results to files. Here is the listing of the first kernel, where the work items together loop over all the stars. The kernel is listed (with minimal comments) just to show what kind of calculations it involves. The kernel calls only one external routine, solve_cramer2, which calculates the inverse of a matrix (not shown).

__kernel void nicer( __global float *K, // [NB], extinction relative to Av __global float *RCOL, // average colours in reference area __global float *RCOV, // covariance of colour in ref. area __global float *MAG, // magnitudes [NS*NB] __global float *dMAG, // magnitude error estimates __global float *AV, // output: estimate Av [NS] __global float *dAV // output: estimated dAv [NS] ) { int i, j, ss = get_global_id(0) ; // one pixel int gs = get_global_size(0) ; if (ss>=NS) return ; // id > NS, the number of ON field stars float C[NB*NB] ; // NB = number of bands float av, b[10] ; for(int s=ss; s<NS; s+=gs) { // work items loop together over all NS stars for(i=0; i<NC; i++) { // NC = number of colours C[NC*NB+i] = -K[i] ; C[i*NB+NC] = -K[i] ; } for(i=0; i<NB; i++) b[i] = 0.0f ; b[NC] = -1.0f ; for(i=0; i<NC; i++) { for(j=0; j<NC; j++) { C[i*NB+j] = 0.0f ; } } for(i=0; i<NC; i++) { C[i*NB+i] = dMAG[s*NB+i]*dMAG[s*NB+i] + dMAG[s*NB+i+1]*dMAG[s*NB+i+1] ; } C[NC+NB*NC] = 0.0f ; for(i=0; i<NC-1; i++) { C[i*NB + i+1] = -dMAG[s*NB+i+1]*dMAG[s*NB+i+1] ; C[(i+1)*NB+i] = -dMAG[s*NB+i+1]*dMAG[s*NB+i+1] ; } for(i=0; i<NC; i++) { for(j=0;j<NC; j++) { C[i*NB+j] += RCOV[i*NC+j*j] ; } } float C0=C[0], C1=C[1], C4=C[4] ; float det ; solve_cramer3(C, b, &det) ; av = 0.0 ; for(i=0; i<NC; i++) av += ((MAG[s*NB+i]-MAG[s*NB+(i+1)])-RCOL[i]) * b[i] ; AV[s] = av ; dAV[s] = sqrt(b[0]*b[0]*C0 + b[1]*b[1]*C4 + 2.0f*b[0]*b[1]*C1) ; } }

The second kernel calculates for each pixel a weighted average of the Av values of individual stars. This time work items (threads) loop over the map pixels. The calculation is done actually two times, the second loop dropping outliers based on user-defined thresholds (CLIP_UP and CLIP_DOWN).

__kernel void smooth( __global float *RA, // coordinates of the stars [NS] __global float *DE, // -"- __global float *A, // Av of individual stars __global float *dA, // dAv of individual stars __global float *SRA, // coordinates of the pixels [NPIX] __global float *SDE, // -"- __global float *SA, // Average Av for a pixel __global float *dSA // error estimate ) { int j, idd = get_global_id(0) ; // index of smoothed value, single pixel int gs = get_global_size(0) ; if (idd>=NPIX) return ; // calculate weighted average with sigma-clipping float ra, de ; // centre of the beam float cosy = cos(de) ; // we can use the plane approximation for distances float w, dx, dy, weight, sum, s2, uw, d2, ave, std, count ; float K = 4.0f*log(2.0f)/(FWHM*FWHM) ; // radian^-2 const float LIMIT2 = 9.0f*FWHM*FWHM ; // ignore stars further than sqrt(LIMIT2) for(int id=idd; id<NPIX; id+=gs) { ra = SRA[id] ; de = SDE[id] ; weight = sum = s2 = uw = count = 0.0f ; for(j=0; j<NS; j++) { dx = cosy* (RA[j]-ra) ; dy = (DE[j]-de) ; d2 = dx*dx + dy*dy ; if (d2<LIMIT2) { w = exp(-K*d2) / (dA[j]*dA[j]) ; weight += w ; sum += w*A[j] ; // weighted sum uw += A[j] ; // unweighted sum s2 += A[j]*A[j] ; // for unweighted standard deviation count += 1.0f ; } } ave = sum/weight ; // weighted average std = sqrt(s2/count - (uw/count)*(uw/count)) ; // unweighted std // Repeat, this time with sigma clipping weight = sum = s2 = uw = count = 0.0f ; for(j=0; j<NS; j++) { dx = cosy* (RA[j]-ra) ; dy = DE[j]-de ; d2 = dx*dx+dy*dy ; if ((d2<LIMIT2) && (A[j]>(ave-CLIP_DOWN*std)) && (A[j]<(ave+CLIP_UP*std)) ) { w = exp(-K*d2) / (dA[j]*dA[j]) ; weight += w ; sum += w*A[j] ; uw += A[j] ; count += 1.0f ; s2 += w*w*dA[j]*dA[j] ; // weighted } } if (count>1.0) { SA[id] = sum/weight ; // weighted average dSA[id] = sqrt(s2)/weight ; // weighted } else { // count <= 1 if (weight>0.0) { SA[id] = sum/weight ; dSA[id] = 999.0 ; } else { SA[id] = 0.0 ; dSA[id] = 999.0 ; } } } }

Interestingly the two kernels scale differently on CPU and GPU. We computed a series of test cases, starting with a map that was 0.5×0.5 degrees in size and had a pixel size of 1.0 arcmin. The number of stars over the area was 75 000. We scaled the problem by increasing either the number of stars (which affects both kernels) or, alternatively, only the number of pixels (affecting the second kernel). The resulting run times are shown in the plot below.

On the x-axis value of 1 corresponds to the original problem size as described above. Here CPU refers to a run with a CPU with 6 hyperthreaded cores. When the number of stars is increased, the scaling is similar for both CPU and GPU, the latter being faster by a factor of 3-4. On the other hand, when the number of pixel is increased and a larger fraction of time is spend on the second kernel, the GPU run time barely increases at all. However, here the number of pixels is increased only up to ~60 000 pixels and by that time GPU is again almost a factor of 4 faster than the 6-core CPU.

*The tests were run on a laptop with a six core i7-8700K CPU running at 3.70 GHz and with an Nvidia GTX-1080 GPU*.