Reality Signal Processor/CPU Core: Difference between revisions

Content added Content deleted

Inline

Revision as of 13:45, 4 April 2022

Scalar unit (SU)

The scalar is the half of the RSP core that is similar to a standard MIPS R4000 32-bit CPU. It has 32 32-bit registers (conventionally called r0-r31) and implement most standard opcodes. This page does not describe the whole scalar unit as standard MIPS documentation suffices, but it highlights the main difference.

Missing opcodes

The following opcodes are not implemented by RSP:

Multiplication units. RSP does not have a multiplication unit so there is no MULT, MULTU, DIV, DIVU, MFHI, MFLO, MTHI, MTLO.
64-bit instructions. RSP only has 32-bit scalar registers in SU, so there is no 64-bit opcodes (the ones starting with D such as DADDIU, DSRL, etc.) nor 64-bit memory accesses such as LD, SD, LDL, SDL.
No opcodes for misaligned memory accesses. All memory accesses to DMEM can be correctly performed also to misaligned addresses, using the standard opcodes like LW / SW or LH / LHU / SH, so there is no LWL, LWR, SWL, SWR.
No traps or exceptions. RSP does not implement any form of interrupt or exception handling, so there is no SYSCALL nor trap instructions (TGE, TLT, etc.). BREAK is available but it has a special behavior (see below).
No support for likely branches. The "likely" variant of all branches is not supported. The missing opcodes are the ones ending with L (such as BEQL, BLEZL, etc.)

Memory access

RSP is a harvard architecture. All opcodes are fetched from IMEM (4KB) and all data is access in DMEM (4KB).

The PC register is 12-bit. All higher address bits in branch / call instructions are thus ignored. When PC reaches the last opcode (at 0xFFC), execution continues to the first opcode in IMEM (PC wraps to 0x000).

All accesses to DMEM are performed using the lowest 12 bits of the address calculated by the load/store instruction (higher bits are ignored). Moreover, contrary to standard MIPS architecture, the RSP can correctly perform misaligned memory accesses (eg: it is possibly to fetch a 32-bit word at address 0x001, that will contain the 4 bytes at 0x1-0x5). Standard MIPS architecture allows to do misaligned addresses only using the LWL/LWR or SWL/SWR couples, which are not required on the RSP.

Vector Unit (VU)

The VU is the internal unit of the RSP CPU core that is able to perform fixed-point SIMD calculations. It is a proprietary design which does not follow any standard specification. Its opcodes and registers are exposed to the core via the COP2 interface.

Vector registers and glossary

VU contains 32 128-bit SIMD registers, each organized in 8 lanes of 16-bit each one. Most VU opcodes perform the same operation in parallel on each of the 8 lanes. The arrangement is thus similar to x86 SSE2 registers in EPI16 format.

The vector registers array is called VPR in this document, so VPR[4] refers to the fifth register (usually called v4 in assembly). When referring to specific portions of the register, we use the following convention:

VPR[vt][4..7] refers to byte indices, that is bytes from 4 to 7, counting from the higher part of the register (in big-endian order).
VPR[vt]<4..7> refers to specific lane indices, that is lanes from 4 to 7 counting from the higher part of the register (in big-endian order).
Within each lane, VPR[vt]<2>(3..0) refers to inclusive bit ranges. Notice that bits are counted as usual in little-endian order (bit 0 is the lowest, bit 15 is the highest), and thus they are written as (high..low).

Ranges are specified using the beg..end inclusive notation (that is, both beg and end are part of the range).

The concatenation of disjoint ranges is written with a ,, for instance: [0..3,8..11] means 8 bytes formed by concatenating 4 bytes starting at 0 with 4 bytes starting at 8.

Vector lanes are usually interpreted as a fixed point number. As a homebrew programmer, it is useful to understand the meaning of each opcode and its correct usage while writing code, which goes beyond the exact hardware description of how bits are shuffled around. To refer to a fixed point number, we use the syntax S1.15 where "S" means "signed" (while "U" is "unsigned"), "1" is the number of bits for the integral part, and "15" are the number of bits for the fractional part.

Accumulator

The RSP contains a 8-lane SIMD accumulator, that is used implicitly by multiplication opcodes. Each of the 8 lanes is 48-bits wide, that allows to accumulate intermediate results of calculations without the loss of precision that would incur when storing them into a 16-bit lane in a vector register.

It is possible to extract the contents of the accumulator through the VSAR opcode; one call to this opcode can extract a 16-bit portion of each lane and store it into the specified vector register. The three portions are conventionally called ACCUM_LO (bits 15..0 of each lane), ACCUM_MD (bits 31..16 of each lane), and ACCUM_HI (bits 47..32 of each lane).

If you exclude the VSAR instruction that cuts the accumulator piecewise for extracting it, it is better to think of it a single register where each lane is 48-bits wide.

Clamping

Multiplication opcodes perform a clamping step when extracting the accumulator into a vector register. Notice that each lane of the accumulator is always treated as a signed 48-bit number.

This is the pseudo-code for signed clamping (no surprises):

function clamp_signed(accum)
    if accum < -32768  => return -32768
    if accum > 32767   => return 32767
    return accum

The returned value is thus always within the signed 16-bit range.

This is the pseudo-code for unsigned clamping:

function clamp_unsigned(accum)
    if accum < 0       => return 0
    if accum > 32767   => return 65535
    return accum

Notice that in unsigned clamping, the saturating threshold is 15-bit, but the saturated value is 16-bit.

Broadcast modifier

Some family of instructions in VU (specifically, the computational instructions and the select instructions) allow to apply a "broadcast modifier" to one of the input registers. For instance, in this instruction

vaddc $v01, $v04, e(1)

e(1) is the broadcast modifier. Normally, the instruction would add the two registers lane by lane; with the modifier, the second lane (index 1) of $v04 is added to all lanes of $v01.

The modifier is stored in the element field of the opcode

`element`	GNU syntax	SGI syntax	Lanes being accessed	Description
0			0,1,2,3,4,5,6,7	Normal register access (no broadcast)
1			0,1,2,3,4,5,6,7	Normal register access (no broadcast)
2	e(0q)	[0q]	0,0,2,2,4,4,6,6	Broadcast 4 of 8 lanes
3	e(1q)	[1q]	1,1,3,3,5,5,7,7	Broadcast 4 of 8 lanes
4	e(0h)	[0h]	0,0,0,0,4,4,4,4	Broadcast 2 of 8 lanes
5	e(1h)	[1h]	1,1,1,1,5,5,5,5	Broadcast 2 of 8 lanes
6	e(2h)	[2h]	2,2,2,2,6,6,6,6	Broadcast 2 of 8 lanes
7	e(3h)	[3h]	3,3,3,3,7,7,7,7	Broadcast 2 of 8 lanes
8	e(0)	[0]	0,0,0,0,0,0,0,0	Broadcast single lane
9	e(1)	[1]	1,1,1,1,1,1,1,1	Broadcast single lane
10	e(2)	[2]	2,2,2,2,2,2,2,2	Broadcast single lane
11	e(3)	[3]	3,3,3,3,3,3,3,3	Broadcast single lane
12	e(4)	[4]	4,4,4,4,4,4,4,4	Broadcast single lane
13	e(5)	[5]	5,5,5,5,5,5,5,5	Broadcast single lane
14	e(6)	[6]	6,6,6,6,6,6,6,6	Broadcast single lane
15	e(7)	[7]	7,7,7,7,7,7,7,7	Broadcast single lane

Opcodes

Loads and stores

31..26	25..21	20..16	15..11	10..7	6..0
`LWC2` or `SWC2`	`base`	`vt`	`opcode`	`element`	`offset`

The instructions perform a load/store from DMEM into/from a vector register.

base is the index of a scalar register used as base for the memory access
offset is an signed offset added to the value of the base register (with some scaling, depending on the actual instruction).
vt is the vector register.
element is used to index a specific byte/word within the vector register, usually specifying the first element affected by the operation (thus allows to access sub-portions of the vector register).

List of all loads and stores opcodes
Group	Opcode	Description
Scalar	LBV / SBV	Load / Store 1 byte into/from a VPR
	LSV / SSV	Load / Store 2 bytes into/from a VPR
	LLV / SLV	Load / Store 4 bytes into/from a VPR
	LDV / SDV	Load / Store 4 bytes into/from a VPR
128-bit	LQV	Load (up to) 16 bytes into a VPR, left-aligned
	LRV	Load (up to) 16 bytes into a VPR, right-aligned
	SQV	Store (up to) 16 bytes from a VPR, left-aligned
	SRV	Store (up to) 16 bytes from a VPR, right-aligned
Transpose	LTV	Load 8 lanes from 8 GPRs into a VPR
	STV	Store 8 lanes of a VPR into 8 GPRs
	SWV
8-bit packed	LPV / SPV	Load / store 8 8-bit signed values into a VPR
	LUV / SUV	Load / store 8 8-bit unsigned values into a VPR
	LHV / SHV	Load / store 8 8-bit unsigned values into VPR, accessing every other byte in memory
	LFV / SFV	Load / store 4 8-bit unsigned values into VPR, accessing every fourth bytes in memory

Single-lane instructions

31..26	25	24..21	20..16	15..11	10..6	5..0
`COP2`	1	`vt_elem`	`vt`	`vd_elem`	`vd`	`opcode`

Single-lane instructions are an instruction group that perform operations on a single lange of a single input register (VT<se>), and store the result into a single lane of a single output register (VD<de>).

vt_elem and vd_elem are used to compute se and de that is to specify which lane, respectively of the input and output register, is affected.

vd_elem is 4 bits long (range 0..15); the highest bit is always ignored so the destination lane de is computed from the lowest 3 bits.

vt_elem is 5 bits long (range 0..31). vt_elem(4) must be zero. When vt_elem(3) is 1, vt_elem(2..0) is actually used as source lane se, as expected. When vt_elem(3) is 0, a hardware bug is triggered and portions of the lower bits of vt_elem are replaced with portion of the bits of vd_elem while computing se. Specifically, all bits in vt_elem from the topmost set bit and higher are replaced with the same-position bits in vd_elem. Notice that this behaviour is actually consistent with what happens when vt_elem(3) is 1, which means that there is no need to think of it as a special-case. Pseudo-code:

de(2..0) = vd_elem(2..0)
msb = highest_set_bit(vt_elem)
se(2..0) = vd_elem(2..msb) || vt_elem(msb-1..0)

TODO: complete analysis for vt_elem(4) == 1.

Single-lane instructions
Opcode	Description
VMOV	Copy one lane of a VPR into another VPR
VRCP	Compute the 32-bit reciprocal of a 16-bit fixed point
VRSQ	Compute the 32-bit reciprocal square root of a 16-bit fixed point
VRCPH	Extract the higher 16-bit of the result of a previous VRCP
VRSQH	Extract the higher 16-bit of the result of a previous VRSQ
VRCPL	Compute the 32-bit reciprocal of a 32-bit fixed point
VRSQL	Compute the 32-bit reciprocal square root of a 32-bit fixed point
VNOP	No operation

Computational instructions

31..26	25	24..21	20..16	15..11	10..6	5..0
`COP2`	1	`element`	`vt`	`vs`	`vd`	`opcode`

Instructions have this general format:

VINSN vd, vs, vt, e(…)

where e(…) is the broadcast modifier (as found in other SIMD architectures), that modifies the access to vt duplicating some lanes and hiding others.

This is the list of opcodes in this group.

Opcode	Instruction	Description
0x00	`VMULF`	Vector multiply S1.15 * S1.15, with rounding and signed clamping
0x01	`VMULU`	Vector multiply S1.15 * S1.15 with rounding and unsigned clamping
0x04	`VMUDL`	Vector multiply U0.16 * U0.16 with signed clamping
0x05	`VMUDM`	Vector multiply S0.16 * U0.16 with signed clamping
0x06	`VMUDN`	Vector multiply U0.16 * S0.16 with signed clamping
0x07	`VMUDH`	Vector multiply S0.16 * S0.16 with signed clamping
0x08	`VMACF`	Like VMULF, but also add the result to the accumulator
0x09	`VMACU`	Like VMULU, but also add the result to the accumulator
0x0C	`VMADL`	Like VMUDL, but also add the result to the accumulator
0x0D	`VMADM`	Like VMUDM, but also add the result to the accumulator
0x0E	`VMADN`	Like VMUDN, but also add the result to the accumulator
0x0F	`VMADH`	Like VMUDH, but also add the result to the accumulator
0x10	`VADD`	Vector add with carry
0x14	`VADDC`	Vector add writing overflow into carry
0x1D	`VSAR`	Read a portion of the accumulator into a VPR
0x28	`VAND`	Vector bitwise and (`a & b`)
0x29	`VNAND`	Vector bitwise nand (`~(a & b)`)
0x2A	`VOR`	Vector bitwise or (`a \| b`)
0x2B	`VNOR`	Vector bitwise nor (`~(a \| b)`)
0x2C	`VXOR`	Vector bitwise xor (`a ^ b`)
0x2D	`VNXOR`	Vector bitwise nxor (`~(a ^ b)`)

Select instructions

31..26	25	24..21	20..16	15..11	10..6	5..0
`COP2`	1	`element`	`vt`	`vs`	`vd`	`opcode`

Instructions have this general format:

VINSN vd, vs, vt[element]

where element is a "broadcast modifier" (as found in other SIMD architectures), that modifies the access to vt duplicating some lanes and hiding others. See the Computational instructions section for details.

This is the list of opcodes in this group:

Opcode	Instruction
0x00	`VLT`
0x01	`VEQ`
0x02	`VNE`
0x03	`VGE`
0x04	`VCL`
0x05	`VCH`
0x06	`VCR`
0x07	`VMRG`

@@ Line 61: / Line 61: @@
 === Broadcast modifier ===
-Some family of instructions in VU (specifically, the computational instructions and the select instructions) allow to apply a "broadcast modifier" to one of the input registers. The modifier is stored in the <code>element</code> field of the opcode
+Some family of instructions in VU (specifically, the computational instructions and the select instructions) allow to apply a "broadcast modifier" to one of the input registers. For instance, in this instruction<syntaxhighlight lang="asm">
+vaddc $v01, $v04, e(1)
+</syntaxhighlight><code>e(1)</code> is the broadcast modifier. Normally, the instruction would add the two registers lane by lane; with the modifier, the second lane (index 1) of <code>$v04</code> is added to all lanes of <code>$v01</code>.
+The modifier is stored in the <code>element</code> field of the opcode
 {| class="wikitable"
 !<code>element</code>
+!GNU syntax
+!SGI syntax
 !Lanes being accessed
 !Description
 |-
 |0
+|
+|
 |0,1,2,3,4,5,6,7
 |Normal register access (no broadcast)
 |-
 |1
+|
+|
 |0,1,2,3,4,5,6,7
 |Normal register access (no broadcast)
 |-
 |2
+|e(0q)
+|[0q]
 |0,0,2,2,4,4,6,6
 |Broadcast 4 of 8 lanes
 |-
 |3
+|e(1q)
+|[1q]
 |1,1,3,3,5,5,7,7
 |Broadcast 4 of 8 lanes
 |-
 |4
+|e(0h)
+|[0h]
 |0,0,0,0,4,4,4,4
 |Broadcast 2 of 8 lanes
 |-
 |5
+|e(1h)
+|[1h]
 |1,1,1,1,5,5,5,5
 |Broadcast 2 of 8 lanes
 |-
 |6
+|e(2h)
+|[2h]
 |2,2,2,2,6,6,6,6
 |Broadcast 2 of 8 lanes
 |-
 |7
+|e(3h)
+|[3h]
 |3,3,3,3,7,7,7,7
 |Broadcast 2 of 8 lanes
 |-
 |8
+|e(0)
+|[0]
 |0,0,0,0,0,0,0,0
 |Broadcast single lane
 |-
 |9
+|e(1)
+|[1]
 |1,1,1,1,1,1,1,1
 |Broadcast single lane
 |-
 |10
+|e(2)
+|[2]
 |2,2,2,2,2,2,2,2
 |Broadcast single lane
 |-
 |11
+|e(3)
+|[3]
 |3,3,3,3,3,3,3,3
 |Broadcast single lane
 |-
 |12
+|e(4)
+|[4]
 |4,4,4,4,4,4,4,4
 |Broadcast single lane
 |-
 |13
+|e(5)
+|[5]
 |5,5,5,5,5,5,5,5
 |Broadcast single lane
 |-
 |14
+|e(6)
+|[6]
 |6,6,6,6,6,6,6,6
 |Broadcast single lane
 |-
 |15
+|e(7)
+|[7]
 |7,7,7,7,7,7,7,7
 |Broadcast single lane
@@ Line 291: / Line 330: @@
 |}
 Instructions have this general format:
- <code>VINSN vd, vs, vt[element]</code>
+ <code>VINSN vd, vs, vt, e(…)</code>
-where <code>element</code> is a "broadcast modifier" (as found in other SIMD architectures), that modifies the access to <code>vt</code> duplicating some lanes and hiding others.
+where <code>e(…)</code> is the broadcast modifier (as found in other SIMD architectures), that modifies the access to <code>vt</code> duplicating some lanes and hiding others.
 This is the list of opcodes in this group.

Reality Signal Processor/CPU Core: Difference between revisions

Revision as of 13:45, 4 April 2022

Contents

Scalar unit (SU)

Missing opcodes

Memory access

Vector Unit (VU)

Vector registers and glossary

Accumulator

Clamping

Broadcast modifier

Opcodes

Loads and stores

Single-lane instructions

Computational instructions

Select instructions

Navigation menu

Reality Signal Processor/CPU Core: Difference between revisions

Revision as of 13:45, 4 April 2022

Scalar unit (SU)

Missing opcodes

Memory access

Vector Unit (VU)

Vector registers and glossary

Accumulator

Clamping

Broadcast modifier

Opcodes

Loads and stores

Single-lane instructions

Computational instructions

Select instructions

Navigation menu

Search