Game Development Reference
In-Depth Information
Figure 3.1. Cell/BE architecture.
or three-dimentional calculations. However, there is a big difference between an
SPU and other general coprocessors. Each SPU has its own memory space and
can run a full C/C++ program. And 90% of the computational power of Cell/BE
lies in SPUs. So optimizing for Cell/BE is equivalent to optimizing for SPUs.
(See Figure 3.1 .)
Each SPU has its own local memory called the local storage. Before running
a program on an SPU, transferring the program and data into the local storage is
crucial. Because SPUs and the PPU don't share memory space, SPUs can't di-
rectly access the main memory. Whenever they have to access the main memory,
they use the direct memory access unit (DMA) to transfer data to and from the
main memory. SPUs can access the local storage faster than they can the main
memory, but the size of the local storage is limited to 256 KB in each SPU. More-
over, the program, data, and stack also consume local storage. That is, the amount
of available memory is very limited. Whenever we write an SPU program, we
have to take into account the size of the available memory. To achieve the highest
performance from an SPU, we must always use the local storage effectively. For-
tunately, SPUs can run DMA transfer and computation simultaneously, so we can
hide the latency of DMA transfer by using the method of double buffering.
3.2.1 Basic Approach for Cell/BE Optimization
A typical rigid-body simulation pipeline has four stages, and rigid-body data are
processed through these stages in order. Because there are dependencies between
stages, they can't be processed simultaneously. In order to parallelize the whole
pipeline, we take the approach of parallelizing each stage using multiple SPUs
(see Figure 3.2 ) .
Each stage is divided into various tasks. A task is a basic element of the
computation processed in parallel on an SPU. If a task becomes too small, the
cost of starting tasks or data transferring may exceed calculation cost. In the worst
case, the parallelized version becomes slower than the original. It is therefore
important to carefully balance the tasks.
Search Nedrilad ::

Custom Search