A key capability in technical computing is the processing of large,
regularly-shaped arrays of numbers by a wide variety of different
processes. This facility is foundational in, for example, weather
prediction, artificial intelligence, and image processing.
Correspondingly, modern computing hardware has evolved advanced
capabilities for carrying out such computations with high efficiency.
Unfortunately, the process of adapting a desired process to a given
piece of hardware thus far is costly, laborious, and error-prone.
Differences of a factor of 50 in performance between a naive realization
and a careful one is the rule, rather than the exception. Loopy, the
subject of this project, attacks this problem by using human-guided,
automated program rewriting. Loopy has demonstrated application impact
in a number of applications ranging from the simulation of natural and
engineering phenomena to neuroscience, where it has helped its users
achieve higher performance with less effort. The present proposal
concerns several important improvements that will contribute to making
Loopy more effective and easier to apply, through enlarging the class of
programs that Loopy can transform, improving the means by which Loopy
represents on-chip communication, and permitting it to realize important
basic operations that routinely pose difficulty in efficient
implementation. An important component of the effort is making Loopy
itself easy to use for its user community, through the realization of an
interactive user interface, so that program transformations can be
applied with the click of a mouse, rather than by writing computer code.
The proposed advances will be demonstrated through a sample workload
that is emblematic of many of the computational and software challenges
faced in technical computing today.
Multidimensional arrays
(sometimes called 'tensors') are a foundational data structure for much
of scientific computing, with applications ranging from weather
prediction to deep learning, to image processing and computational
neuroscience. Even the efficient execution of one of the simplest
operations on arrays, matrix-matrix multiplication, poses considerable
technical challenges on modern computers. Through a polyhedrally-based
program transformation tool, the proposed software will provide
separation between mathematical intent and the technical challenges of
program optimization, allowing each task to be performed by a domain
expert. In the proposed project, the PI will develop means for more
efficient on-chip communication, code generation for prefix sums, reuse
and abstraction in program transformation, increasing the ease of use in
transformation discovery and performance analysis, and for expressing
array computations in user programs. The PI will validate the proposed
techniques through a challenging application with broad applicability.
The intellectual merit of the proposed research lies in (1) mapping out
and extending the landscape of transformation-based programming from
one-off scripts to reusable transform components, (2) the development of
a unifying, loop/array-axis-based approach to expressing on-chip
communication while reducing redundancy in Loopy?s program
representation and transformation, (3) exploring the design space of
high-performance languages that establish a close link between execution
placement and data placement, (4) the development of an interactive
program transform and performance analysis tool, along with the
discovery of potential implications for workforce training in
high-performance computing, (5) a demonstration that all the developed
components can be applied together in a practical and coherent manner.
Through graduate and undergraduate teaching as well as mentoring of the
students and postdocs supported by this project, the PI contributes to
enlarging the talent pool.
Funding
Elements: Transformation-Based High-Performance Computing in Dynamic Languages
Directorate for Computer & Information Science & Engineering