Many public-domain (and a few proprietary) FFTs were benchmarked along with FFTW. There are a staggering number of FFT implementations floating around; hopefully, this benchmark will put an end to the confusion and allow most of the FFTs to slip quietly into oblivion.
In addition to benchmarking the performance, we also measured the numerical accuracy of each of the codes.
In order to keep the times comparable for different values of N, we graphed the time for one iteration divided by N log2 N. (Recall that, in principle, the FFT takes O(N log N) time to compute. Thus, we are plotting the "constant" factor in this relation.)
It should be noted that, due to flaws in modern architectures and compilers, there are uncertainties in the benchmark results that have nothing to do with the timing accuracy.
For both one and three dimensional transforms, we calculated the mean error vs. array size for all the FFTs. The results are only presented for one platform since they were similar on all the machines. (Actually, there were a few codes that did not provide inverse transform routines; the error for these routines was not computed.)
We would like to take this opportunity to thank the authors of the various FFTs, especially those who have placed their code in the public domain. Without their aid as a reference, their experiences to learn from, and the stimulation of their competition, FFTW would never have been developed.
First, the CWP prime factor algorithm implementation was sometimes faster for very large arrays (in one dimension). It should be noted that prime factor algorithms place strong restrictions on the size of the input array. In particular, CWP does not work for sizes that are powers of two, and thus all the results for CWP were measured on input arrays slightly bigger than the ones used for the other FFTs. In some sense, then, this is comparing apples and oranges. Nevertheless, in many applications people do not really care about the exact size of the problem, and therefore CWP becomes a valid alternative to FFTW. We hope to incorporate some sort of prime factor algorithm in a future release of FFTW. (Actually, FFTW v1.1 does use the prime-factor algorithm in the codelets for composite base cases such as 6, 12, and 15.)
Second, the Cray C90 SCILIB beat FFTW on that machine, but this is a special case (see the C90 results for more information.) We also benchmarked other vendor-optimized codes such as SUNPERF and ESSL; FFTW compares very favorably with these codes--see the charts for more details. ESSL was faster for 3d transforms of sizes that weren't powers of two, as it also apparently uses a prime factor algorithm.
The reader should be aware that not all of the FFTs listed in the graph legends will be visible in every plot. There are several reasons for this. First of all, some routines (fft2 and Beauregard especially) were so slow that they are usually completely off the scale of the graphs (which were designed for comparison of the fastest routines, not the slowest). Second, in the graphs for array sizes that are not powers of two, some routines do not appear because they do not support those transform sizes. Finally, the data points for FFTW_ESTIMATE are often on top of those for FFTW, obscuring the latter.
cc -O3 -qarch=pwrx -qtune=pwrx
.
cc dalign -fast -native -xO5
.
cc -O3 -h align
.
Go back to the FFTW home page.