Reality Signal Processor/CPU Core

Scalar unit (SU)
The scalar is the half of the RSP core that is similar to a standard MIPS R4000 32-bit CPU. It has 32 32-bit registers (conventionally called - ) and implement most standard opcodes. This page does not describe the whole scalar unit as standard MIPS documentation suffices, but it highlights the main difference.

Missing opcodes
The following opcodes are not implemented by RSP:


 * Multiplication units. RSP does not have a multiplication unit so there is no MULT, MULTU, DIV, DIVU, MFHI, MFLO, MTHI, MTLO.
 * 64-bit instructions. RSP only has 32-bit scalar registers in SU, so there is no 64-bit opcodes (the ones starting with D such as DADDIU, DSRL, etc.) nor 64-bit memory accesses such as LD, SD, LDL, SDL.
 * No opcodes for misaligned memory accesses. All memory accesses to DMEM can be correctly performed also to misaligned addresses, using the standard opcodes like LW / SW or LH / LHU / SH, so there is no LWL, LWR, SWL, SWR.
 * No traps or exceptions. RSP does not implement any form of interrupt or exception handling, so there is no SYSCALL nor trap instructions (TGE, TLT, etc.). BREAK is available but it has a special behavior (see below).
 * No support for likely branches. The "likely" variant of all branches is not supported. The missing opcodes are the ones ending with L (such as BEQL, BLEZL, etc.)

Memory access
RSP is a harvard architecture. All opcodes are fetched from IMEM (4KB) and all data is access in DMEM (4KB).

The PC register is 12-bit. All higher address bits in branch / call instructions are thus ignored. When PC reaches the last opcode (at 0xFFC), execution continues to the first opcode in IMEM (PC wraps to 0x000).

All accesses to DMEM are performed using the lowest 12 bits of the address calculated by the load/store instruction (higher bits are ignored). Moreover, contrary to standard MIPS architecture, the RSP can correctly perform misaligned memory accesses (eg: it is possibly to fetch a 32-bit word at address 0x001, that will contain the 4 bytes at 0x1-0x5). Standard MIPS architecture allows to do misaligned addresses only using the LWL/LWR or SWL/SWR couples, which are not required on the RSP.

Vector Unit (VU)
The VU is the internal unit of the RSP CPU core that is able to perform fixed-point SIMD calculations. It is a proprietary design which does not follow any standard specification. Its opcodes and registers are exposed to the core via the COP2 interface.

Vector registers and glossary
VU contains 32 128-bit SIMD registers, each organized in 8 lanes of 16-bit each one. Most VU opcodes perform the same operation in parallel on each of the 8 lanes. The arrangement is thus similar to x86 SSE2 registers in EPI16 format.

The vector registers array is called  in this document, so   refers to the fifth register (usually called   in assembly). When referring to specific portions of the register, we use the following convention:


 * refers to byte indices, that is bytes from 4 to 7, counting from the higher part of the register (in big-endian order).
 * refers to specific lane indices, that is lanes from 4 to 7 counting from the higher part of the register (in big-endian order).
 * Within each lane,  refers to inclusive bit ranges. Notice that bits are counted as usual in little-endian order (bit 0 is the lowest, bit 15 is the highest), and thus they are written as.

Ranges are specified using the  inclusive notation (that is, both   and   are part of the range).

The concatenation of disjoint ranges is written with a, for instance:   means 8 bytes formed by concatenating 4 bytes starting at 0 with 4 bytes starting at 8.

Vector lanes are usually interpreted as a fixed point number. As a homebrew programmer, it is useful to understand the meaning of each opcode and its correct usage while writing code, which goes beyond the exact hardware description of how bits are shuffled around. To refer to a fixed point number, we use the syntax  where "S" means "signed" (while "U" is "unsigned"), "1" is the number of bits for the integral part, and "15" are the number of bits for the fractional part.

Accumulator
The RSP contains a 8-lane SIMD accumulator, that is used implicitly by multiplication opcodes. Each of the 8 lanes is 48-bits wide, that allows to accumulate intermediate results of calculations without the loss of precision that would incur when storing them into a 16-bit lane in a vector register.

It is possible to extract the contents of the accumulator through the VSAR opcode; one call to this opcode can extract a 16-bit portion of each lane and store it into the specified vector register. The three portions are conventionally called  (bits 15..0 of each lane),   (bits 31..16 of each lane), and   (bits 47..32 of each lane).

If you exclude the VSAR instruction that cuts the accumulator piecewise for extracting it, it is better to think of it a single register where each lane is 48-bits wide.

Control registers
The VU contains 3 16-bit control registers: VCC, VCO, VCE.

These registers are used as flag registers by several opcodes. As with most flags, even though they have a general meaning that is generally valid, they tend to also be used in some mind-twisting way to obtain the desired result. It doesn't really make sense to try to describe them at the general level and instead each instruction will explain if/how it uses or modifies the control registers.

To read/write the contents of the control registers, the  /   instructions can be used.

Clamping
Multiplication opcodes perform a clamping step when extracting the accumulator into a vector register. Notice that each lane of the accumulator is always treated as a signed 48-bit number.

This is the pseudo-code for signed clamping (no surprises): The returned value is thus always within the signed 16-bit range.

This is the pseudo-code for unsigned clamping: Notice that in unsigned clamping, the saturating threshold is 15-bit, but the saturated value is 16-bit.

Element field
Most VU instructions have a 3-register format with an additional modifier called "element field". For instance (using GNU assembly syntax):  is the "element modifier". Normally (and especially in GNU syntax, which is more orthogonal and uniform), it refers to a specific lane of the third register, which is why it is common to format it without a leading whitespace. In this example, it "selects" lane 7 of register. The exact meaning of the element modifier varies for different instruction groups, and also the way it is assembled changes wildly. Pay attention to the description of each instruction group to check what the element modifier means and how it is encoded in the opcode.

Broadcast modifier
One of the most common uses of the element field is the broadcast modifier. This modifier is used by computational instructions and select instructions and allows to "broadcast" (duplicate) one or more lanes to other lanes, just for the purpose of the current opcode. For instance:  is the broadcast modifier. Normally, the instruction would add the two registers lane by lane; with the modifier, the second lane (index 1) of  is added to all lanes of.

The modifier is stored in the  field of the opcode

Loads and stores
The instructions perform a load/store from DMEM into/from a vector register.


 * is the index of a scalar register used as base for the memory access
 * is a signed offset added to the value of the base register (with some scaling, depending on the actual instruction).
 * is the vector register.
 * is used to index a specific byte/word within the vector register, usually specifying the first element affected by the operation (thus allows to access sub-portions of the vector register).

Single-lane instructions
Single-lane instructions are an instruction group that perform operations on a single lange of a single input register, and store the result into a single lane of a single output register.

Example syntax: In this example, the value in lane  is moved to lane. In the assembly syntax, the broadcast modifier syntax is used, but no actual broadcast is performed, as the instructions operate on the single specified lane. Only the single-lane broadcast modifiers ( ...  ) are supported.+

In the opcode, the fields  and   are used to compute   and   that is to specify which lane, respectively of the input and output register, is affected.

is 5 bits long (range 0..31); the highest bits are always ignored, and the destination lane  is simply.

is 4 bits long (range 0..15). When  is 1,   is actually used as source lane , as expected. When  is 0, a hardware bug is triggered and portions of the lower bits of   are replaced with portion of the bits of   while computing. Specifically, all bits in  from the topmost set bit and higher are replaced with the same-position bits in. Notice that this behaviour is actually consistent with what happens when  is 1, which means that there is no need to think of it as a special-case. Pseudo-code:

Computational instructions
Instructions have this general format: where  is the broadcast modifier (as found in other SIMD architectures), that modifies the access to   duplicating some lanes and hiding others.

This is the list of opcodes in this group.

Select instructions
Instructions have this general format: where  is the broadcast modifier (as found in other SIMD architectures), that modifies the access to   duplicating some lanes and hiding others. See the Computational instructions section for details.

This is the list of opcodes in this group:

VU/SU Moves
These are the standard MIPS opcodes for moving data in/out the coprocessor registers. Vector moves follow the same format as standard MIPS coprocessor moves, but use part of the lower 11 bits (which are normally unused) to specify the element field, selecting which lane of the VPR is accessed. Notice that,  in this case is not a broadcast modifier: it specifies a byte offset (not a lane index!), so to copy a lane,   must be specified.

This is an example using GNU syntax: This example will copy the lower 16 bits of GPR  into the fifth lane of. This opcode is assembled with, as explained above.

moves the lower 16 bits of the general purpose register  to the bytes. If  is 15, only   is written (with  ).

moves the 2 bytes  to GPR , sign extending the 16 bits value to 64 bits. If  is 15, the lower byte is taken from byte 0 of the register (that is, it wraps around).

moves the lower 16 bits of GPR  into the control register specified by , while   does the reverse, moving the control register specified by   into GPR  , sign extending to 64 bits. Note that both  and   ignore the   field. For these instructions, the control register is specified as follows:

Assembly
Notice that it is possible to specify the lane syntax for the  field to refer to a specific lane, but if the access is made using   or   (4 or 8 bytes), it will overflow into the following lanes.

Description
These instructions load a scalar value (1, 2, 4, or 8 bytes) from DMEM into a VPR. Loads affect only a portion of the vector register (which is 128-bit); other bytes in the register are not modified.

The address in DMEM where the value is fetched is computed as, where   is the number of bytes being accessed (eg: 4 for  ). The address can be misaligned: despite how memory accesses usually work on MIPS, these instructions perform unaligned memory accesses.

The part of the vector register being accessed is, that is   selects the first accessed byte within the vector register. When  is bigger than 15, fewer bytes are processed (eg:   with   only loads 3 byte from memory into  ).

Usage
These instructions are seldom used. Normally, it is better to structure RSP code to work across full vectors to maximize parallelism. Input data should already be provided in vectorized format by the CPU, so that it is possible to use a vector load (, in case the input is made of 16-bit data) or a packed load ( / , in case the input is made of 8-bit data). Consider also using  to load a 16-bit value into a lane of a VPR when the value is available in a GPR.

A possible use-case for these instructions is to reverse the order of the lanes. For instance, in audio codecs, windowing algorithms often work combining sequences audio samples with other sequences in reverse order. RSP does not have an instruction to reverse the order of the lanes, so in that case it might be necessary to manually reverse the lanes while loading using :

Assembly
Notice that it is possible to specify the lane syntax for the  field to refer to a specific lane, but if the access is made using   or   (4 or 8 bytes), it will overflow into the following lanes.

Description
These instructions store a scalar value (1, 2, 4, or 8 bytes) from a VPR into DMEM.

The address in DMEM where the value will be stored is computed as, where   is the number of bytes being accessed (eg: 4 for  ). The address can be misaligned: despite how memory accesses usually work on MIPS, these instructions perform unaligned memory accesses.

The part of the vector register being accessed is, that is   selects the first accessed byte within the vector register. When  is bigger than 15, the element access wraps within the vector and a full-size store is always performed (eg:   with   stores   into memory, for a total of 4 bytes).

Usage
These instructions are seldom used. Normally, it is better to structure RSP code to work across full vectors to maximize parallelism. Data flow between RSP and VR4300 should be structured in vectorized format, so that it is possible to use a vector store (, in case the output is made of 16-bit data) or a packed load ( / , in case the output is made of 8-bit data). Consider also using  to store a 16-bit value from the lane of a VPR into a GPR.

See scalar loads for an example of a use-case (reversing a vector) that can be implemented also via.

Assembly
Notice that the element field is optional (defaults to 0) and is usually not specified because these instructions are meant to affect the whole vector. The element field can be specified using the lane syntax or a raw number which maps to the byte offset inside the vector.

Description
Roughly, these functions behave like  and  : combined, they allow to read 128 bits of data into a vector register, irrespective of the alignment.

When the data to be loaded is 128-bit aligned within DMEM,  is sufficient to read the whole vector (  in this case is redundant because it becomes a no-op).

The actual bytes accessed in DMEM depend on the instruction: for, the bytes are those starting at  , up to and excluding the next 128-bit aligned byte (  in the above example); for  , the bytes are those starting at the previous 128-bit aligned byte (  in the above example) up to and excluding. Again, this is exactly the same behavior of  and , but for 128-bit aligned loads.

is used as a byte offset within the vector register to specify the first byte affected by the operation; that is, the part of the vector being loaded with the instruction pair is. Thus a non-zero element means that fewer bytes are loaded.

Usage
is the most standard way to fill a full VPR vector register loading its contents from DMEM. Given that it's usually possible to define the layout of data in DMEM, it is advisable to design it so that vectors are always aligned to 128-bit (16 bytes), using the  directory: this allows to read the vector using just , in 1 cycle (though the load has a 3-cycle latency like all instructions that write to a VPR). One example of using  and   in pair is to perform a fast memcpy from a possible misaligned address to an aligned destination buffer: