Reality Signal Processor/CPU Pipeline
RSP pipeline
This page describes how the RSP CPU pipeline works, from a software point of view. It does not guess how the hardware internally work, but it just describes the effects of the pipelines on code execution
VU / SU pipelines
The CPU core contains two different pipelines that can run in parallel.
- VU (vector unit): this is the vector pipeline, that is able to execute the special vector instructions (encoded as COP2 opcodes). The opcodes that run in the VU are identified by the opcode name that starts with letter
v
. For instance,vadd
andvmod
are vector instructions that runs in the VU. NOTE: vector loads/stores such aslqv
orsqv
are not considered vector instructions and do not run on the VU, as testified by their opcode name, - SU (scalar unit): this is the scalar pipeline, that is able to execute the standard MIPS 32-bit scalar instructions. For instance,
lw
,srl
,suv
are all instructions that run in the SU.
Pipeline stalls: write latency
In a normal condition, all RSP opcodes run exactly in 1 clock cycle but they might have different latency. This means that while the opcode itself appears to run in 1 clock cycle, the destination register might not be available in time for the next instruction. In general:
- All instructions that write to a VU register run in 1 clock cycle but their destination register has a 4 cycle latency. This includes all VU opcodes (eg: `
vaddc
`) but also SU opcodes that writes to a VU register such as all load ops (eg:lqv
) andmtc2
. - Most instructions that write to a SU register run in 1 clock cycle and have a 1 cycle latency. The following are the exceptions:
- DMEM loads (
lw
, etc.) run in 1 cycle too but have a 3-cycle latency. Notice that stores do the same, but there is no way to cause a pipeline stall with stores, so we are not focusing on them. mfc0,
mfc2
,cfc2
run in 1 cycle and have a 3-cycle latency
- DMEM loads (
vsubc $v02, $v10, $v11 # Writes to $v02
vlt $v02, $v02, $v12 # Reads from $v02 => STALL (3 cycles)
vsubc
has 4 cycle latency, the destination register is available only on the 4th cycle after it. Given that the next opcode vlt
tries to read from $v02
, a stall is issued that lasts for 3 cycles. Basically the effect is similar to:vsubc $v02, $v10, $v11 # Writes to $v02
nop
nop
nop
vlt $v02, $v02, $v12 # Reads from $v02 => NO STALL (it's on the 4th cycle after vsubc
lw
: lw t0, 0(s0) # Writes to t0
sll t0, 2 # Reads from t0 => STALL (2 cycles)
lpv $v00, 0(s0) # Writes to $v00
vaddc $v00, $v01 # Reads from $v00 => STALL (3 cycles)
Pipeline stalls: branches
Branches run in 3 cycles (including the delay slot):
- One cycle for the branch instruction
- One cycle for the delay slot
- One cycle of delay ("pipeline bubble") for internally finalizing the branch.
and t0, 1 # Cycle 0
bnez t0, label # Cycle 1
vmulf $v00, $v01, $v02 # Cycle 2
label:
addiu a0, 1 # Cycle 4 (cycle 3 was the bubble)
lw a0, (s0) # Cycle 0
bnez t0, label # Cycle 1
addiu a0, 8 # Cycle 2-3 (1 stall because a0 wasn't ready yet)
label:
addiu a0, 1 # Cycle 5 (cycle 4 was the bubble)
Pipeline stalls: stores after loads
A stall is generated any time a memory store follows exactly 2 cycles after a memory load, irrespective of what instructions they are, or what registers they do affect. For instance:lw t0, 0(s0)
nop
sqv $v04, 0(s1) # STALL: write happening two cycles after load
mfc0
, mtc0
, mfc2
, mtc2
, cfc2
, catch
do not access memory, but they are also cause the same exact stall, and they even count as both load and store, irrespective of the behavior of the actual instruction. For instance:lw t0, 0(s0)
nop
mtc0 v0, COP0_SP_STATUS # STALL: mtc0 happening two cycles after load
lw t0, 0(s0)
nop
mfc0 v0, COP0_SP_STATUS # STALL: mfc0 happening two cycles after load (even if "mfc0" seems itself a load...)
mtc2 t0, $v04.e0
nop
cfc2 v0, COP2_VCC # STALL: cfc2 happening two cycles after mtc2
Dual-issue
Given that VU and SU mostly run in parallel, RSP is able to run two instructions in just 1 clock cycle. This happens when a SU and a VU instructions are run next to each other (though not always, see below):vmudh $v02, $v04, $v06.e7
add t0, t1 # DUAL-ISSUE (SU after VU)
add t0, t1
vmudh $v02, $v04, $v06.e7 # DUAL-ISSUE (VU after SU)
There are a few cases where it is not possible to dual-issue:
- After a branch: the first instruction on the target of a branch can dual-issue only if it is 8-byte aligned. When writing a hot loop, make sure the loop start (target of the end-loop branch) is 8-byte aligned so that you don't lose the dual-issue opportunity on the first instruction.
- The delay slot of a branch never dual-issue (whether the branch is taken or not).
- If the first instruction of a pair writes to a vector register that is either read or written by the second instruction, the pair will not dual-issue. Since we are discussing vector registers here, this applies when the SU instruction is one of the few that access vector registers (that is, vector loads,
mfc2
,mtc2
) .
vand $v04, $v30, $v31
lqv $v04, 0(s0) # NO DUAL ISSUE with previous op: writes to $v04 which was also written by vand
vand $v04, $v30, $v31
mfc2 t0, $v04.e4 # NO DUAL ISSUE with previous op: reads from $v04 which was written by vand
mfc2 t0, $v04.e4
vand $v04, $v30, $v31 # DUAL ISSUE with previous op: the previous instruciton only reads from $v04
- Similarly, CFC2/CTC2 (SU instructions) can prevent dual-issue if they access a control register that is read/written by the instruction they dual-issue with. In this case, the RSP is a bit overbroad because VU instructions that access the control register are counted as both reading and writing it, even though they only read them.
ctc2 t0, VCO
vadd $v00, $v01, $v02 # NO DUAL ISSUE: vadd reads from VCO, which was written by the ctc2
vmrg $v00, $v01, $v02 # NOTE: vmrg only reads from VCC
ctc2 t0, VCC # NO DUAL-ISSUE: when pairing with ctc2, first instruction is treated as if it was also writing to VCC
Dual-issue: hardware bug with single-lane instructions
There is a hardware bug in RSP that prevents dual-issuing when using single-lane instructions as second instruction of the pair in special cases. The single-lane instructions affected by this bug are: VRCP
, VRCPL
, VRCPH
, VMOV
, VRSQ
, VRSQL
, VRSQH
, VNOP
.
The bug is related to the interpretation of the de
field of the opcode, which is treated as a register number (rather than an element modifier) for the purpose of checking read/write conflicts.
mtc2 t0, $v04.e2
vrcp $v01.e4, $v02.e7 # NO DUAL ISSUE: hardware bug: "e4" wrongly treated as a reference to "$v04"
.e0
- .e7
), it is possible to create conflicts only with registers $v00
- $v07
. At the hardware level, though, it is possible to specify a full 5-bit index number (from 0 to 31), which is exposed via the old SGI syntax. For instance, these instructions use different opcodes encoding but produce exactly the same result:vrcp $v01[e4], $v02[e7]
vrcp $v01[e12], $v02[e7] # Same as $v01.e4
vrcp $v01[e20], $v02[e7] # Same as $v01.e4
vrcp $v01[e28], $v02[e7] # Same as $v01.e4
mtc2 t0, $v04.e2
vrcp $v01[e12], $v02[e7] # DUAL ISSUE: will still write to $v01.e4, but the hardware bug sees it as a fake reference to $v12