Skip to main content

Parallelism

The algorithms implemented in Nancy often comprise on a single operation to be done, independently, on a large set of elements. For example, the (min,+) convolution consists of a) a large set of element-element convolutions, b) grouping the results of the convolutions by intervals of overlap, c) computing the by-interval lower envelopes. These operations, being independent (as Nancy is also implemented with immutable objects), can be easily parallelized.

The C# language and .NET Runtime make this easy to do with little code changes, mainly through the AsParallel() method that makes a LINQ query run on all the cores available, rather than a single one.

Note on the effective speed-ups​

It is often the case in the research community that one is asked about the effectiveness of the parallelization of an algorithm, in particular whether the speedup is close to the number of cores or not. We wish to dispel this notion by arguing that it is an anti-practical point of view.

In fact, it would be meaningful to wonder if an algorithm could gain significant performance improvements by augmenting its degree of parallelism if we were making a choice on the amount of cores to have. On the contrary, in most practical use cases, we are not making this choice.

In most systems nowadays, as we are used to running advanced operating systems with heavy multiprocessing and interactivity, all but the lowest price options will feature 8 cores, with many common options featuring 12 or 16. Furthermore, the amount of cores usually grows in tandem with the processing power of the single core, such that the number of cores is rarely a direct factor on the price-to-performance evaluation.

Thus, since we already have multicore systems and their parallel processing power available, the question should be is this software capable of exploiting the resources available on the system to complete faster, rather than sit on 15/16 of its processing power? In the case of Nancy, the answer is yes.

That is not to say that the approach is perfect - many heuristics are employed to decide if an operation would benefit from going parallel or not, and such heuristics may fail or require adjustments on a different architecture. However, even if only a 10x speedup were obtained on a 16 core machine, that is still six minutes instead of an hour.