Go back to the benchFFT results page.

This page outlines our benchmarking methodology. For complete details, you can look at the source code, available from benchFFT home page.

Note that we only currently benchmark single-processor performance, even on multi-processor systems.

Routines that accept a single format were benchmarked with that format. Routines that accept either format were benchmarked with interleaved data, which seems to be the most commonly used.

In interpreting the benchmark results, please notice that different formats are not strictly comparable.

Again, routines that use different formats are not strictly comparable. Consequently, please interpret our plots only as a rough description of the relative merits of different codes. A faster code is of no use if it produces the output in a format that you do not want.

For each machine and compiler, we hand-pick "good" compiler options that seem to work well for optimized codes, which are then used for all codes, except for those that come with their own compilation scripts to pick their own optimization flags. Descriptions of the chosen compilers and flags are included in the results.

As far as possible, we use the unmodified, original codes, except where very minor patches were required for successful compilation. Rigorous correctness tests are performed to make sure that we are calling the FFT properly (and that it is bug-free).

Our FFT timing measurement is intended to reflect the common case where many FFTs of the same size, indeed of the same array, are required. Thus, we break the measurement into two parts:

First, we call any separate **initialization/setup** routines
provided by the code. This step may include calling the code's FFT
once, if it performs initialization on the first call. This setup
time is measured separately from the FFT performance below, but only
as a rough indicator; no attempt is made to perform repeated
measurements or to make our initialization preparations as efficient
as possible.

Second, we measure the **FFT performance** by performing
repeated FFTs of the same zero-initialized array.

The input array is initialized to zero to prevent divergences from repeated FFTs of the same array. No FFT code that we know of, or any floating-point hardware in the benchmark, has a speed that depends on the input data (except for floating-point exceptions).

The timing procedure consists of two loops. First, we compute
enough repeated FFTs so that the total time is sufficient for accurate
timing, and divide by the number of iterations to obtain the average
time. Second, we repeat this averaging process eight times, and
report the *minimum* average time (to avoid fluctuations due to
system interrupts, cache priming, etcetera).

The timer routine is `gettimeofday`

, and the minimum
time for accurate measurement is computed using a calibration routine
from lmbench, a free
Unix benchmark by Larry McVoy and Carl Staelin, who graciously
permitted us to use their routine under the LGPL license.

To report FFT performance, we plot the "**mflops**" of each FFT,
which is a scaled version of the speed, defined by:

mflops = 5 N log_{2}(N) / (time for one FFT in microseconds) for complex transforms, and

mflops = 2.5 N logwhere N is number of data points (the product of the FFT dimensions). This is not an actual flop count; it is simply a convenient scaling, based on the fact that the radix-2 Cooley-Tukey algorithm asymptotically requires 5 N log_{2}(N) / (time for one FFT in microseconds) for real transforms,

In order to keep the graphs readable, however, **we only plot the
"fastest" results**. That is, when a particular code has several
variants (different radices, calling options, etcetera), we only plot
the variant that is "on average" the fastest (with a few exceptions, for
codes of particular interest or for variants with radically different
performance characteristics). We also only plot either the forward or
the backward transform speed, whichever is faster "on average".

To quantify the "average" speed of a FFT routine, and also to
reorder the plot legend for improved readability, we define a **plot
rank** for each FFT as follows. First, for each transform size in a
plot, we compute a *rank* = (mflops for FFT) / (mflops for
fastest FFT for that size). Then, the *plot rank* of a given FFT
is defined as the median of its ranks for all sizes in the plot. The
plot rank is only a convenient heuristic to produce useful plots, and
it should not be interpreted as an absolute measure of performance.

In the raw benchmark data output, the speed for all routines, for
both forward and backward transforms, is collected in the file

in the space-delimited format:
*host*.speed

name-of-code transform-type transform-size mflops time setup-time

where the times are in seconds.

*transform-type* is a four-character string consisting of
precision (double/single = `d`

/`s`

), type
(complex/real = `c`

/`r`

), in-place/out-of-place
(= `i`

/`o`

), and forward/backward (=
`f`

/`b`

). For example, *transform-type* =
`dcif`

denotes a double-precision in-place forward
transform of complex data.

Go back to the benchFFT results page.