### Solving chemical networks

The chemical evolution of the interstellar medium can be described by chemical networks that specify which reactions take place and at what rates. For a given physical situation the rates are constant and the time evolution of the chemical abundances is described by a set of ordinary differential equation (ODEs). The number of species is typically hundreds and the number of reactions is thousands.

Methods to solve ODEs are readily available as library routines so the main practical question is how to represent the equations in a computer. One needs to implement two functions, the first returning the derivatives and the second returning the Jacobian for the parameter vector. The parameter vector consists of the abundances of the examined species for the current time step. An explicit Jacobian function is not always necessary but should speed up the computation of the ODEs.

In the program pyRate (Sipilä 2010) the solution for the time evolution of chemical species was written in Python, using standard ODE solvers found in the scipy library. As an interpreted language, Python is not very fast in numerical calculations. Therefore, in the first pyRate version these functions were written in Pyrex/Cython. The Python main program wrote a separate Cython programme that was compiled on the fly and given to the ODE solver routine. In later programme versions Cython was replaced with an Python extension. However, the idea remained the same. The ODE solver uses compiled routines and there is minimal overhead from the main program being written in Python.

The rate coefficients for the given physical setting were first written directly into the subroutines. Once the routine is compiled, this ensured fast ODE solution. However, usually one has a range of physical conditions for which the solution is needed, for example in the case of models with a number of cells with different densities and temperatures. If the subroutine needs to be recompiled for every cell, this is a significant overhead.

The next logical step is to write the subroutines in a more generic way and to pass the rate coefficients as a parameter vector. This way the subroutines need to be compiled only once, while the execution of the subroutines calls may now take a bit longer.

We compare below some of the run times for a test case that included 419 species and 4211 reactions and for which the time evolution was computed for a period of 10^{6} years. These results are for different cases where the gradient vector was implemented as a compiled Python extension and the Jacobian was not used.

- the first version (rate coefficients embedded in the subroutine) took 0.34 seconds and the compilation of the subroutine an additional 1.36 seconds
- adding optimisation to the compilation (-O2), the ODE solution became faster (0.23 seconds) but the compilation took now 7.9 seconds!
- the next version (rate coefficients as a subroutine parameter) had a run time of 1.17, which is faster than above, considering that this version does not require recompilation
- with optimisations (-O2) the run time dropped to 0.46 seconds
- OpenMP parallelisation reduced the run time to 0.35 seconds
- because the parallelisation is over species (each associated to a different number of reactions), the use of OpenMP dynamic scheduling improved the run time further down to 0.19 seconds

Thus, the final version could thus process larger models at a rate of four cells per second. We note that the speedup provided by OpenMP is still far from perfect. It is only a factor of ~2.5 while the CPU had six cores available!

We also tested a version where the gradient vector calculation was implemented as an OpenCL kernel. The results were not particularly encouraging. The run times being over 5 seconds and very similar on both CPU and GPU. OpenCL has significant overhead from the fact that each function call includes the copy and transfer of the whole abundance and abundance gradient vectors. Furthermore, unlike with OpenMP, there is no dynamic scheduling. This will lead to some threads being idle while they are waiting others that are processing species with more reactions.

OpenCL might still be competitive for much larger problems. However, the ratio of computations vs. data transfer does not increase very rapidly as new species are added. The other option would be to transfer all computations to the device, thus avoiding the data transfer overheads altogether. We have experimented with ODE solvers on GPUs. In those cases the parallelisation can also be done across model cells (rather than across chemical species) because that should provide good work balance between the threads. In preliminary tests, OpenCL on CPU can be competitive with the above versions but the performance on GPUs has been disappointing. Implicit ODE solvers require the solving of linear equations of moderate size. This has turned out (so far) very inefficient on GPUs, when each thread is working on a different set of equations.