Go back to the FFTW home page.

Parallel Versions of FFTW

Starting with FFTW 1.1, we are including parallel Fourier transform subroutines in the FFTW distribution. Parallel computation is a very important issue for many users, but few (no?) parallel FFT codes are publicly available. We hope that we can begin to fill this gap.

We are currently experimenting with several parallel versions of FFTW. These have not yet been optimized nearly as fully as FFTW itself, nor have they been benchmarked against other parallel FFT software (mainly because parallel FFT codes are few and far in between). They are, however, a start in the right direction.

Parallel versions of FFTW exist for:

Benchmarks of the parallel FFTs

We have done a small amount of benchmarking of the parallel versions of FFTW. Unfortunately, there isn't much to benchmark them against except for each other--we have had some difficulty in finding other, publicly available, parallel FFT software. On the Cray T3D, however, we were able to compare Cray's PCCFFT3D routine to FFTW.

Benchmark results are available for the following machines:

How the parallel FFTW transforms work

The following is a very brief overview of the methods with which we currently parallelize FFTW.

One-dimensional Parallel FFT

The Cooley-Tukey algorithm that FFTW uses to perform one-dimensional transforms works by breaking a problem of size N=N1N2 into N2 problems of size N1 and N1 problems of size N2. The way in which we parallelize this transform, then, is simply to divide these sub-problems equally among different threads. In Cilk, this amounts to little more than adding spawn keywords in front of subroutine calls (thanks to FFTW's explicitly recursive structure). Currently, we keep the data in the same order as we do for the uniprocessor code; in the future, we may reorder the data at each phase of the computation to achieve better segregation of the data between processors.

Multi-dimensional Parallel FFT (shared memory)

Multi-dimensional transforms consist of many one-dimensional transforms along the rows, columns, etcetera of a matrix. For the Cilk and threads versions of FFTW, we simply divide these one-dimensional transforms equally between the processors. Again, we leave the data in the same order as it is in the uniprocessor case. Unfortunately, this means that the arrays being transformed by different processors are interleaved in memory, resulting in more memory contention than is desirable. We are investigating ways to alleviate this problem.

Multi-dimensional Parallel FFT (distributed memory)

The MPI code works differently, since it is designed for distributed memory systems--here, each processor has its own memory or address space in which it holds a portion of the data. The data on a processor is inaccessible to other processors except by explicit message passing. In the MPI version of FFTW, we assume that a multi-dimensional array is distributed across the rows (the first dimension) of the data. To perform the FFT of this data, each processor first transforms all the dimensions of the data that are completely local to it (e.g. the rows). Then, the processors have to perform a transpose of the data in order to get the remaining dimension (the columns) local to a processor. This dimension is then Fourier transformed, and the data is (optionally) transposed back to its original order.

The hardest part of the multi-dimensional, distributed FFT is the transpose. A transpose is one of those operations that sounds simple, but is very tricky to implement in practice. The real subtlety comes in when you want to do a transpose in place and with minimal communications. Essentially, the way we do this is to do an in-place transpose of the data local to each processor (itself a non-trivial problem), exchange blocks of data with every other processor, and then complete the transpose by a further (precomputed) permutation on the data. All of this involves an appalling amount of bookkeeping, especially when the array dimensions aren't divisible by the number of processors. The communications phase is an example of a complete exchange: an operation in which every processor sends a different message to every other processor. (This is probably the worst thing that you can do to a memory architecture.)

The whole process would be simpler, and possibly faster, if the transpose weren't required to be in-place. We felt, however, that if a problem is large enough to require a distributed memory multiprocessor, it is probably large enough that the capability to perform a transform in-place is essential.

Go back to the FFTW home page.