# The Fastest Fourier Transform in the West

### Matteo Frigo

Vanu Inc. and MIT Laboratory for Computer Science

## Abstract

FFTW is a system for computing the discrete Fourier transform (DFT) of
both real and complex multidimensional data. In constrast to existing
DFT libraries that implement a fixed algorithm, FFTW automatically
adapts the computation to the underlying hardware so as to maximize
performance. FFTW is generally at least as fast as comparable
implementations of the DFT, including those that are tuned for a
specific machine.
FFTW is an extreme example of automatic program specialization.
Specifically, FFTW computes the DFT in three stages.

The first stage occurs before compilation time, when the ``genfft''
domain-specific compiler specializes a high-level specification of a
DFT algorithm into a set of ``codelets.'' A codelet is an optimized
computational kernel written in C that either computes the DFT of a
small input array or can be composed with other codelets to compute
the DFT of a larger input. genfft is written in Objective Caml, and
its partial-evaluation capabilities are powerful enough to derive DFT
algorithms for real data from a complex-data algorithm. In some
cases, genfft even derives algorithms that were not previously known.
genfft uses the theory of ``cache-oblivious algorithms'' [Frigo et
al., 1999] to produce C code for which a good register allocation
exists.

The second stage occurs during program initialization, after the
codelets have been compiled and linked together to form an executable.
In this stage, a ``planner'' accepts a description of the DFT problem
to be solved and produces a ``plan,'' which is a composition of
codelets that solves that particular problem. The planner performs a
dynamic-programming search over the space of allowable plans and
selects the fastest by explicit time measurements. This is the
machine-dependent step in which FFTW adapts itself to the hardware.

Finally, an ``executor'' accepts a plan and the data to be transformed
and computes the DFT of the data. A plan can be used to compute as
many DFTs as desired. In the current FFTW system, plans are Lisp-like
closures that can be invoked directly, but in older FFTW
implementations the executor was an explicit interpreter for a
special-purpose DFT language.

In this talk, I present the FFTW system and I argue that, because of
the unpredictability of current computer architectures, FFTW
represents a reasonable programming paradigm for high-performance
software.

Joint work with Steven G. Johnson of the MIT Department of Physics.

This work was supported in part by the Defense Advanced Research
Projects Agency (DARPA) under Grant F30602-97-1-0270 and by the
Special Research Program SFB F011 ``AURORA'' of the Austrian Science
Fund FWF.