Automatically optimizing tree traversal algorithms
Many domains in computer science, from data-mining to graphics to computational astrophysics, focus heavily on irregular applications. In contrast to regular applications, which operate over dense matrices and arrays, irregular programs manipulate and traverse complex data structures like trees and graphs. As irregular applications operate on ever larger datasets, their performance suffers from poor locality and parallelism. Programmers are burdened with the arduous task of manually tuning such applications for better performance. Generally applicable techniques to optimize irregular applications are highly desired, yet scarce. ^ In this dissertation, we argue that, for an important subset of irregular programs which arises in many domains, namely, tree traversal algorithms like Barnes-Hut, nearest neighbor and ray tracing, there exist general techniques to enhance performance. We investigate two sources of performance improvement: locality enhancement and vectorization. Furthermore we demonstrate that these techniques can be automatically applied by an optimizing compiler, relieving programmers of manual, error-prone, application-specific effort. ^ Achieving high performance in many applications requires achieving good locality of reference. We propose two novel transformations called point blocking and traversal splicing, inspired by the classic tiling loop transformation, and show that it can substantially enhance temporal locality in tree traversals. We then present a transformation framework called TREE SPLICER, that automatically applies these transformations, and uses autotuning techniques to determine appropriate parameters for the transformations. For six benchmark algorithms, we show that a combination of point blocking and traversal splicing can deliver single-thread speedups of up to 8.71 (geometric mean: 2.48), just from better locality. ^ Modern commodity processors support SIMD instructions, and using these instructions to process multiple traversals at once has the potential to provide substantial performance improvements. Unfortunately tree algorithms often feature highly diverging traversals which inhibit efficient SIMD utilization, to the point that other, less profitable sources of vectorization must be exploited instead. We propose a dynamic reordering of traversals based on previous behavior, based on the insight that traversals which have behaved similarly so far are likely to behave similarly in the future, and show that this reordering can dramatically improve the SIMD utilization of diverging traversals, close to ideal utilization. We present a transformation framework, SIMTREE, which facilitates vectorization of tree algorithms, and demonstrate speedups of up to 6.59 (geometric mean: 2.78). Furthermore our techniques can effectively SIMDize algorithms that prior, manual vectorization attempts could not.^
Milind Kulkarni, Purdue University.