Go back to the Parallel FFTW page.

Parallel FFTW on a Cray T3D

The T3D is a distributed memory machine, so we could only run the MPI code on it. Just as for the Sun benchmark, we measured the speedup of the MPI code (relative to the uniprocessor FFTW) for three-dimensional transforms. This was plotted as a function of transform size for 1, 2, 4, 8, and 16 processors.

Two versions of the MPI FFTW are plotted. The first one returns the output data in the same format as the input data. The second version returns the output data in transposed order (this is faster because it saves the cost of an extra transpose).

We also had access to a Cray-provided, optimized FFT for the T3D. The speed of this software is shown at the bottom of the page for comparison. It should be noted that this transform is out-of-place, unlike the MPI FFTW transform. The performance of FFTW relative to the Cray FFT is disappointing, but we are working on a faster version.

MPI FFTW, 3D Transforms

Cray 3D FFT (PCCFFT3D) (out-of-place)

Plotted value is speedup relative to uniprocessor FFTW.

Go back to the Parallel FFTW page.