## Least squares fits

Observations of dust emission are often analysed by fitting the spectra with a modified blackbody function (MBB). This consists of the Planck law *B(T)* that is multiplied with a powerlaw, *f*^{β}. Here *f* stands for the observed frequency. By fitting multi-wavelength observations with this function, one can determine the temperature *T* of the dust particles (actually the “colour temperature” that is related to the temperature distribution within the telescope beam) and the spectral index β that should depend on the properties of the dust grains.

The MBB spectrum can be fitted to observations by finding the least squares solution (i.e. assuming that the measurements have normally distributed errors). When one is dealing with images, one takes the data for a single pixel (single spatial position), one value from each of the maps observed at different wavelengths, fits these with the MBB model, registers the resulting temperature and spectral index values – and moves to the next pixel. The only problem is that images may contain quite a few pixels, which can mean lengthy calculations. However, as long as one assumes that every pixel position can be fitted separately, that means that they can be computed in parallel.

We wrote a OpenCL program for the MBB fitting of dust observations and compared that to the use of scipy.optimize leastsq routine. The OpenCL algorithm is a very simple gradient descent method that in itself should be much less efficient that the least squares routines found in Scipy. However, the OpenCL version could have two advantages. First, the OpenCL is all compiled code, although we will be calling it from within a Python script. Also Scipy’s leastsq routine is a compiled library routine but it will call a function that is defined in Python and this will cause some overhead. Secondly, OpenCL routine will be run in parallel and especially on GPU with potentially thousands of concurrent threads.

The first figure compares the results from the Python and OpenCL routines (using the default tolerances). The upper rows shows the intensity, temperature, and spectral index maps that were used to generate synthetic observations at 160, 250, 350, and 500µm wavelengths, to which we also added 3% of random errors. The second row shows the difference between the parameter values recovered by the Python least routine and the input values. The deviations from zero are almost entirely caused by the noise added to the observations. The last row shows the difference between the results from the OpenCL and the Python leastsq routines. With stricter tolerances the OpenCL solution approaches the leastsq solution. This suggests that the differences come mainly from the OpenCL solution. However, the accuracy already seems to be “good enough”, the difference between the OpenCL and Python solutions being two orders of magnitude smaller than the noise-induced uncertainty.

The next figure compares the run times between the Python leastsq solution and the OpenCL routine run on CPU and on two different GPUs. The analysed images ranged from 3×3 up to 1000×1000 pixels, the linear size of the image being on the *x* axis. Of course, only the total number of pixels matters and this varies by five orders of magnitude. On the *y* axis are the run times. Up to image sizes ~40×40, the Python routine is the fastest but all routines take less than one second to run. For larger images, the Python version is consistently some some ten times slower than the OpenCL routine that is run on the CPU. In addition to having four cores for parallel execution, the OpenCL routine has advantage from the use of compiled code that also seems to outweigh the effect of the efficient algorithm.

The most interesting result is the comparison between Python and the execution of the OpenCL routine on GPUs. When the image size approaches 1000×1000 pixels, the OpenCL routine is **more than two orders of magnitude faster**, even when run on the integrated GPU (GPU1 in the figure). On the more powerful external GPU, the OpenCL run time is still completely dominated by the constant overheads and remains close to one second! This is a good example of the potential of GPGPU computing. However, one should take into account that the example is one where Python is performing particularly poorly, because of the numerous calls to user-provided Python functions that are short and cannot make use of vector operations. MBB is also not a particularly difficult optimisation problem, especially since the range of acceptable temperature and spectra index values is rather limited. A general-purpose optimisation routine would be more complicated and probably also slower.

*The tests were run on a laptop with a four core i7-8550U CPU running at 1.80GHz. The integrated GPU was Intel Corporation UHD Graphics 620 and the external GPU NVidia GTX 1080 Ti. The code for theses tests can be found at https://github.com/mjuvela/ISM, in the MBB subdirectory. *

## Markov chain Monte Carlo

Markov chain Monte Carlo (MCMC) is a general method to estimate probability distributions and can therefore be applied to a very wide range of problems. Here we take as an example the fitting of dust emission spectra.

Multi-wavelength observations in the far-infrared / submillimetre regime are typically fitted with a *modified blackbody spectrum*. The model has three free parameters: intensity, temperature, and spectral index. These can be further related to the dust optical depth or column density, but here were are looking only at the fitting problem *I(f) = I(f _{0}) B(T)* (

*f/f*)

_{0}^{β}. We have observations of intensity

*I(f)*at least on three frequencies

*f*and we examine the modified blackbody spectrum relative to the intensity

*I(f*at some reference frequency

_{0})*f*. Above

_{0}*B(T)*stands for the Planck function, which depends on the temperature

*T*, and the spectrum is further modified by the (opacity) spectral index

*β*. Given observations

*I(f)*and their uncertainties

*dI(f)*, our task is to estimate the values of the free parameters

*I(f*,

_{0})*T*, and

*β*… and estimate their probability distributions.

Below I test four implementations of the MCMC method and compare the run times with the χ^{2} fits that are done with python and the scipy library. Unlike χ^{2} fits, MCMC provides complete probability distributions for the fitted parameters but could be correspondingly slower. However, here we implement MCMC using OpenCL and more specifically the Python interface pyOpenCL, which makes it possible to run the calculations on GPU as well as on CPU. The four MCMC version are

- Basic Metropolis algorithm. The parameter vector is updated with random steps generated from normal distribution and only the overall step size is adjusted according to the acceptance rate.
- Propositions for the steps are generated from three-dimensional normal distribution (corresponding to the three free parameters). The covariance matrix of the multivariate normal distribution is updated based on past steps. This adaptive Metropolis (AM) method was first described in Haario et al. 2001.
- Hamiltonian (or Hybrid) Monte Carlo (HMS) as explained in Neal 2011.
- Robust Adaptive Metropolis (RAM) algorithm described in Vihola 2011.

These are not necessarily robust implementations (they may be sensitive to the initial parameter values and do not make checks for convergence etc.) but they should give some idea of the relative performance. The running of the test script plots an example of the fitting of one spectrum:

The first frame shows the individual samples in the (T, β) plane where the input values for the spectrum simulation were T=15.0K and β=1.8. The colours black, blue, red, and green correspond to the four MCMC versions described above. The middle frames show pieces of the MCMC chains (no thinning, towards the end of 10000 step chains). The basic Metropolis has lower acceptance rate, in spite of the adjusted step size and the fact that the relative step sizes along the three coordinate axes were first selected manually. The same is visible in the last plot of the (normalised) autocorrelation functions.

We next compare the run times to direct chi2 minimisation with a python routine, to see if the knowledge of the posterior probability distributions are worth the extra computations. The run times for 10000 spectra and with 10000 MCMC steps each are the following:

chi2 minimisation 95.4 seconds Metropolis 7.3 seconds AM 7.3 seconds HMS 229.0 seconds RAM 6.9 seconds

Apart from the third version, MCMC calculations (with the selected number of steps) are all faster than the python optimisation. A least squares routine would be less general; it would be faster but still not as fast as the fastest MCMC versions. HMS is slow because of the loop in the generation of the propositions for the steps.

Because the script uses OpenCL, we can run it also on GPU, simply by changing one parameter. Going from CPU (2.9GHz i7-4910MQ) to a GPU (AMD R9-290M) the run times become

Metropolis 1.8 seconds AM 1.4 seconds HMS 9.7 seconds RAM 1.3 second

The results could change with tuning of the parameters. AM and RAM perform very similarly, both regarding the run time and the effective sample sizes. HMS sees the largest speed-up on GPU but remains slow in comparison. The problem is also too simple for the benefits of Hamiltonian MCMC to be noticeable. The HMC run time on CPU may also have been affected by CPU throttling, the test being run on a laptop. Basic Metropolis is fast but the samples are badly correlated. In constrast, RAM worked well without any tunable parameters.