We present FGen, a program generator for high performance convolution operations (finite-impulse-response filters). The generator uses an internal mathematical DSL to enable structural optimization at a high level of abstraction. We use FGen as a testbed to demonstrate how to provide modular and extensible support for modern SIMD vector architectures in a DSL-based generator. Specifically, we show how to combine staging and generic programming with type classes to abstract over both the data type (real or complex) and the target architecture (e.g., SSE or AVX) when mapping DSL expressions to C code with explicit vector intrinsics. Benchmarks shows that the generated code is highly competitive with commercial libraries.
Unlocking the Future of AI Agents with Large Language Models
Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters
1. Abstracting Vector Architectures in Library
Generators: Case Study Convolution Filters
Alen Stojanov
ETH Zurich
Georg Ofenbeck
ETH Zurich
Tiark Rompf
EPFL, Oracle Labs
Markus Püschel
ETH Zurich
Department of Computer Science,
ETH Zurich, Switzerland
2. Generators for High Performance Libraries
3984 C functions
1M lines of code
Computer generated
TRANSFORMS
DFT (fwd+inv), RDFT (fwd+inv)
DCT2, DCT3, DCT4, DHT, WHT
SPIRAL: Code Generation for DSP Transforms, M. Püschel, J. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic,
Y. Voronenko, K. Chen, R. Johnson, N. Rizzolo, Proceedings of the IEEE special issue on "Program Generation, Optimization, and Adaptation,"
Vol. 93, No. 2, 2005, pp. 232-275
1
3. DATA TYPE
Real / Complex Numbers
Split / Interleaved Arrays
Single / Double Precision
ISA
Scalar, SSE,
LRB, AVX
CODE STYLE
Looped / Unrolled
123
1.23
• Math knowledge: DFT, RDFT,
DHT, WHT, etc.
• Algorithms represented in
high level array-style DSLs
• Low Level Optimizations
TRANSFORMS
DOMAIN
4. NEW DOMAIN
DATA TYPE
Real / Complex Numbers
Split / Interleaved Arrays
Single / Double Precision
ISA
Scalar, SSE,
LRB, AVX
CODE STYLE
Looped / Unrolled
123
1.23
6. Staging
ABSTRACTION
• lower cost of domain
implementation
• reuse at many levels
ISA
Code
Style
Data
Type
GOAL Lightweight
Modular
Staging1
1Lightweight modular staging: a pragmatic approach to runtime code
generation and compiled DSLs. T. Rompf, M. Odersky, GPCE 2010
9. Staged
def add (!
a: Float, !
b: Float!
): Float = {!
a + b!
}!
def add (!
a: Rep[Float], !
b: Rep[Float]!
): Rep[Float] = { !
a + b!
}!
Non-Staged
Float vs. Rep[Float]
LMS uses Scala Types
to control staging
T vs. Rep[T] - LMS
overloads operations on a
standard type [T] with a
staged version type Rep[T]
Staged and non-staged
code can be abstracted
using type parameters
def add [T] (!
a: T, b: T!
): T = { !
a + b!
}!
add [Rep[Float]]!
type NoRep[T] = T!
add[NoRep[Float]]!
10. Building generators with staging
def add (!
a: Rep[Float], !
b: Rep[Float]!
): Rep[Float] = { !
a + b!
}!
float add (!
float a, !
float b!
) = { !
return a - b;!
}!
Staging
AST Rewrites:
• Domain specific
optimizations
• Low level
optimizations
Unparsing
to C code
def add (!
a: Rep[Float], !
b: Rep[Float]!
): Rep[Float] = { !
a + b * (-1)!
}!
a
b-1
a b
11. Abstracting Code Styles with Staging1
for (i <- 0 to n: Rep[Int]) { f(i) }!
for (i <- 0 to n: NoRep[Int]) { f(i) }!
n f(i)
for
ArrayApply(a, 1)!
ArrayApply(a, 3)!
val (x1, x2) = (a(1), a(3))!
val x3 = x1 + x2!
val a: Array[Rep[T]]!val a: Rep[Array[T]]!
a
x1
1
x2
3
a
x1
x2
a(0)
a(1)
a(2)
a(3)
a(4)
Looped
Code
Unrolled
Code
Arrays of Staged Elements
(i.e. array scalarization)Staged Arrays
1Spiral in Scala: towards the systematic construction of generators for performance libraries.
G. Ofenbeck, T. Rompf, A. Stojanov M. Odersky, M. Püschel. GPCE 2013
f(0)
f(1)
f(2) f(n)
…f(3)
12. Scala Type Classes
used to encapsulate
implementation
abstract class NumericOps[T] {!
def plus (x: T, y: T) : T!
def minus (x: T, y: T) : T!
def times (x: T, y: T) : T!
}!
class SISDOps[T] extends NumericOps[Rep[SISD[[T]]] !
{!
type E = Rep[SISD[[T]]!
def plus (x: E, y: E) = Plus [E](x, y)!
def minus(x: E, y: E) = Minus[E](x, y)!
def times(x: E, y: E) = Times[E](x, y)!
}!
SISD (staged)
class SIMDOps[T] extends NumericOps[Rep[SIMD[[T]]]!
{ !
type E = Rep[SIMD[[T]]!
def plus (x: E, y: E) = VPlus [E](x, y)!
def minus(x: E, y: E) = VMinus[E](x, y)!
def times(x: E, y: E) = VTimes[E](x, y)!
}!
SIMD (staged, ISA independent)
abstract class ArrayOps[…] {!
def alloc (…): …!
def apply (…): …!
def update (…): …!
}!
19. Multi-level Tiling Σ-LL
Instruction Set Architecture
Loop Optimizations Σ-LL
Memory Layout Specialisation
Data Type Instantiation
ISA Specialization
Data Type & Memory Layout
Code Level Optimization I-IR
Optimised C code
Convolution Expression
PerformanceFeedbackLoop
FGen
Program generator for high
performance convolution
operations, based on DSLs
↓
omitted
omitted
26. Summary
• Generic programming coupled
with staging used to build
program generators
• Single codebase for different code
styles, ISA and data abstraction
• Applicable to multiple domains
• Comparable results to commercial
libraries