A key capability in technical computing is the
processing of large, regularly-shaped arrays of numbers by a wide
variety of different processes. This facility is foundational in, for
example, weather prediction, artificial intelligence, and image
processing. Correspondingly, modern computing hardware has evolved
advanced capabilities for carrying out such computations with high
efficiency. Unfortunately, the process of adapting a desired process to a
given piece of hardware thus far is costly, laborious, and error-prone.
Differences of a factor of 50 in performance between a naive
realization and a careful one is the rule, rather than the exception.
Loopy, the subject of this project, attacks this problem by using
human-guided, automated program rewriting. Loopy has demonstrated
application impact in a number of applications ranging from the
simulation of natural and engineering phenomena to neuroscience, where
it has helped its users achieve higher performance with less effort. The
present proposal concerns several important improvements that will
contribute to making Loopy more effective and easier to apply, through
enlarging the class of programs that Loopy can transform, improving the
means by which Loopy represents on-chip communication, and permitting it
to realize important basic operations that routinely pose difficulty in
efficient implementation. An important component of the effort is
making Loopy itself easy to use for its user community, through the
realization of an interactive user interface, so that program
transformations can be applied with the click of a mouse, rather than by
writing computer code. The proposed advances will be demonstrated
through a sample workload that is emblematic of many of the
computational and software challenges faced in technical computing
today.
Multidimensional arrays (sometimes called 'tensors') are a
foundational data structure for much of scientific computing, with
applications ranging from weather prediction to deep learning, to image
processing and computational neuroscience. Even the efficient execution
of one of the simplest operations on arrays, matrix-matrix
multiplication, poses considerable technical challenges on modern
computers. Through a polyhedrally-based program transformation tool, the
proposed software will provide separation between mathematical intent
and the technical challenges of program optimization, allowing each task
to be performed by a domain expert. In the proposed project, the PI
will develop means for more efficient on-chip communication, code
generation for prefix sums, reuse and abstraction in program
transformation, increasing the ease of use in transformation discovery
and performance analysis, and for expressing array computations in user
programs. The PI will validate the proposed techniques through a
challenging application with broad applicability. The intellectual merit
of the proposed research lies in (1) mapping out and extending the
landscape of transformation-based programming from one-off scripts to
reusable transform components, (2) the development of a unifying,
loop/array-axis-based approach to expressing on-chip communication while
reducing redundancy in Loopy?s program representation and
transformation, (3) exploring the design space of high-performance
languages that establish a close link between execution placement and
data placement, (4) the development of an interactive program transform
and performance analysis tool, along with the discovery of potential
implications for workforce training in high-performance computing, (5) a
demonstration that all the developed components can be applied together
in a practical and coherent manner. Through graduate and undergraduate
teaching as well as mentoring of the students and postdocs supported by
this project, the PI contributes to enlarging the talent pool.
Funding
Elements: Transformation-Based High-Performance Computing in Dynamic Languages
Directorate for Computer & Information Science & Engineering