README.Cellfile included in FFTW. We also provide some benchmarks from an IBM Cell Blade and a PlayStation 3.
Cell support was removed in FFTW version 3.3 in 2011, primarily because we lack a machine to test on, combined with a perceived lack of user interest for the last few years. Users who wish to employ FFTW on the Cell can continue to use version 3.2.2.
The Cell code in the FFTW was written and graciously donated to the FFTW project by the IBM Austin Research Laboratory. We are grateful to Pat Bohrer and Lorraine Herger of IBM for this generous contribution.
Cell consists of one PowerPC core ("PPE") and of a number of Synergistic Processing Elements ("SPE") to which the PPE can delegate computation. The IBM QS20 Cell blade offers 8 SPEs per Cell chip. The Sony PlayStation 3 contains 6 useable SPEs.
This version of FFTW fully utilizes the SPEs for one- and multi-dimensional complex FFTs of sizes that can be factored into small primes, both in single and double precision. Transforms of real data use SPEs only partially at this time. If FFTW cannot use the SPEs, it falls back to a slower computation on the PPE.
This library is meant to use the SPEs transparently without user intervention. However, certain caveats apply, which are discussed later in this document.
To enable support for Cell in double precision:
configure --enable-cell make make install
In single precision:
configure --enable-cell --enable-single make make install
In addition, the PPE supports the Altivec (or VMX) instruction set in single precision. (Altivec is Apple/Freescale terminology, VMX is IBM terminology for the same thing.) You can enable support for Altivec with the "--enable-altivec" flag (single precision only).
The software compiles with the Cell SDK 2.0, and probably with earlier ones as well.
fftw_cell_set_nspe(n), where "n" is the number of desired SPEs. Expect this interface to go away once we figure out how to make FFTW play nicely with other Cell software.
In particular, if you try to link both the single and double precision of FFTW in the same program (which you can do), they will both try to grab all SPEs and the second one will hang.
FFTW_ESTIMATEmode may produce seriously suboptimal plans, and it becomes particularly confused if you enable both the SPEs and Altivec. If you care about performance, please use
FFTW_PATIENTuntil we figure out a more reliable performance model.
The SPEs are fully IEEE-754 compliant in double precision. In single precision, they only implement round-towards-zero as opposed to the standard round-to-even mode. (The PPE is fully IEEE-754 compliant like all other PowerPC implementations.) Because of the rounding mode, FFTW is less accurate when running on the SPEs than on the PPE. The accuracy loss is hard to quantify in general, but as a rough guideline, the L2 norm of the relative roundoff error for random inputs is 4-8 times larger than the corresponding calculation in round-to-even arithmetic. In other words, expect to lose 2 to 3 bits of accuracy.
FFTW currently does not use any algorithm that degrades accuracy to gain performance on the SPE. One implication of this choice is that large 1D transforms run slower than they would if we were willing to sacrifice another bit or so of accuracy.
These benchmarks show the results of running benchFFT on an IBM Cell Blade and a PlayStation 3. Note that, of the programs benchmarked, only FFTW uses the Cell SPEs.