Reality Signal Processor/CPU Core: Difference between revisions

From N64brew Wiki
Jump to navigation Jump to search
 
(22 intermediate revisions by the same user not shown)
Line 42: Line 42:
   
 
If you exclude the VSAR instruction that cuts the accumulator piecewise for extracting it, it is better to think of it a single register where each lane is 48-bits wide.
 
If you exclude the VSAR instruction that cuts the accumulator piecewise for extracting it, it is better to think of it a single register where each lane is 48-bits wide.
  +
  +
=== Control registers ===
  +
The VU contains 3 16-bit control registers: VCC, VCO, VCE.
  +
  +
These registers are used as flag registers by several opcodes. As with most flags, even though they have a general meaning that is generally valid, they tend to also be used in some mind-twisting way to obtain the desired result. It doesn't really make sense to try to describe them at the general level and instead each instruction will explain if/how it uses or modifies the control registers.
  +
  +
To read/write the contents of the control registers, the <code>ctc2</code> / <code>cfc2</code> instructions can be used.
   
 
=== Clamping ===
 
=== Clamping ===
Line 64: Line 71:
 
vaddc $v01, $v04, e(1)
 
vaddc $v01, $v04, e(1)
 
</syntaxhighlight><code>e(1)</code> is the broadcast modifier. Normally, the instruction would add the two registers lane by lane; with the modifier, the second lane (index 1) of <code>$v04</code> is added to all lanes of <code>$v01</code>.
 
</syntaxhighlight><code>e(1)</code> is the broadcast modifier. Normally, the instruction would add the two registers lane by lane; with the modifier, the second lane (index 1) of <code>$v04</code> is added to all lanes of <code>$v01</code>.
 
   
 
The modifier is stored in the <code>element</code> field of the opcode
 
The modifier is stored in the <code>element</code> field of the opcode
Line 87: Line 93:
 
|-
 
|-
 
|2
 
|2
|e(0q)
+
|<code>e(0q)</code>
|[0q]
+
|<code>[0q]</code>
 
|0,0,2,2,4,4,6,6
 
|0,0,2,2,4,4,6,6
 
|Broadcast 4 of 8 lanes
 
|Broadcast 4 of 8 lanes
 
|-
 
|-
 
|3
 
|3
|e(1q)
+
|<code>e(1q)</code>
|[1q]
+
|<code>[1q]</code>
 
|1,1,3,3,5,5,7,7
 
|1,1,3,3,5,5,7,7
 
|Broadcast 4 of 8 lanes
 
|Broadcast 4 of 8 lanes
 
|-
 
|-
 
|4
 
|4
|e(0h)
+
|<code>e(0h)</code>
|[0h]
+
|<code>[0h]</code>
 
|0,0,0,0,4,4,4,4
 
|0,0,0,0,4,4,4,4
 
|Broadcast 2 of 8 lanes
 
|Broadcast 2 of 8 lanes
 
|-
 
|-
 
|5
 
|5
|e(1h)
+
|<code>e(1h)</code>
|[1h]
+
|<code>[1h]</code>
 
|1,1,1,1,5,5,5,5
 
|1,1,1,1,5,5,5,5
 
|Broadcast 2 of 8 lanes
 
|Broadcast 2 of 8 lanes
 
|-
 
|-
 
|6
 
|6
|e(2h)
+
|<code>e(2h)</code>
|[2h]
+
|<code>[2h]</code>
 
|2,2,2,2,6,6,6,6
 
|2,2,2,2,6,6,6,6
 
|Broadcast 2 of 8 lanes
 
|Broadcast 2 of 8 lanes
 
|-
 
|-
 
|7
 
|7
|e(3h)
+
|<code>e(3h)</code>
|[3h]
+
|<code>[3h]</code>
 
|3,3,3,3,7,7,7,7
 
|3,3,3,3,7,7,7,7
 
|Broadcast 2 of 8 lanes
 
|Broadcast 2 of 8 lanes
 
|-
 
|-
 
|8
 
|8
|e(0)
+
|<code>e(0)</code>
|[0]
+
|<code>[0]</code>
 
|0,0,0,0,0,0,0,0
 
|0,0,0,0,0,0,0,0
 
|Broadcast single lane
 
|Broadcast single lane
 
|-
 
|-
 
|9
 
|9
|e(1)
+
|<code>e(1)</code>
|[1]
+
|<code>[1]</code>
 
|1,1,1,1,1,1,1,1
 
|1,1,1,1,1,1,1,1
 
|Broadcast single lane
 
|Broadcast single lane
 
|-
 
|-
 
|10
 
|10
|e(2)
+
|<code>e(2)</code>
|[2]
+
|<code>[2]</code>
 
|2,2,2,2,2,2,2,2
 
|2,2,2,2,2,2,2,2
 
|Broadcast single lane
 
|Broadcast single lane
 
|-
 
|-
 
|11
 
|11
|e(3)
+
|<code>e(3)</code>
|[3]
+
|<code>[3]</code>
 
|3,3,3,3,3,3,3,3
 
|3,3,3,3,3,3,3,3
 
|Broadcast single lane
 
|Broadcast single lane
 
|-
 
|-
 
|12
 
|12
|e(4)
+
|<code>e(4)</code>
|[4]
+
|<code>[4]</code>
 
|4,4,4,4,4,4,4,4
 
|4,4,4,4,4,4,4,4
 
|Broadcast single lane
 
|Broadcast single lane
 
|-
 
|-
 
|13
 
|13
|e(5)
+
|<code>e(5)</code>
|[5]
+
|<code>[5]</code>
 
|5,5,5,5,5,5,5,5
 
|5,5,5,5,5,5,5,5
 
|Broadcast single lane
 
|Broadcast single lane
 
|-
 
|-
 
|14
 
|14
|e(6)
+
|<code>e(6)</code>
|[6]
+
|<code>[6]</code>
 
|6,6,6,6,6,6,6,6
 
|6,6,6,6,6,6,6,6
 
|Broadcast single lane
 
|Broadcast single lane
 
|-
 
|-
 
|15
 
|15
|e(7)
+
|<code>e(7)</code>
|[7]
+
|<code>[7]</code>
 
|7,7,7,7,7,7,7,7
 
|7,7,7,7,7,7,7,7
 
|Broadcast single lane
 
|Broadcast single lane
 
|}
 
|}
   
=== Opcodes ===
+
=== Instructions overview ===
   
 
==== Loads and stores ====
 
==== Loads and stores ====
Line 192: Line 198:
   
 
* <code>base</code> is the index of a scalar register used as base for the memory access
 
* <code>base</code> is the index of a scalar register used as base for the memory access
* <code>offset</code> is an signed offset added to the value of the base register (with some scaling, depending on the actual instruction).
+
* <code>offset</code> is a signed offset added to the value of the base register (with some scaling, depending on the actual instruction).
 
* <code>vt</code> is the vector register.
 
* <code>vt</code> is the vector register.
 
* <code>element</code> is used to index a specific byte/word within the vector register, usually specifying the first element affected by the operation (thus allows to access sub-portions of the vector register).
 
* <code>element</code> is used to index a specific byte/word within the vector register, usually specifying the first element affected by the operation (thus allows to access sub-portions of the vector register).
Line 200: Line 206:
 
!Group
 
!Group
 
!Opcode
 
!Opcode
  +
!Instruction
 
!Description
 
!Description
 
|-
 
|-
 
| rowspan="4" |Scalar
 
| rowspan="4" |Scalar
  +
|0x00
|LBV / SBV
 
  +
|<code>lbv</code> / <code>sbv</code>
 
|Load / Store 1 byte into/from a VPR
 
|Load / Store 1 byte into/from a VPR
 
|-
 
|-
  +
|0x01
|LSV / SSV
 
  +
|<code>lsv</code> / <code>ssv</code>
 
|Load / Store 2 bytes into/from a VPR
 
|Load / Store 2 bytes into/from a VPR
 
|-
 
|-
  +
|0x02
|LLV / SLV
 
  +
|<code>llv</code> / <code>slv</code>
 
|Load / Store 4 bytes into/from a VPR
 
|Load / Store 4 bytes into/from a VPR
 
|-
 
|-
  +
|0x03
|LDV / SDV
 
  +
|<code>ldv</code> / <code>sdv</code>
|Load / Store 4 bytes into/from a VPR
 
  +
|Load / Store 8 bytes into/from a VPR
 
|-
 
|-
 
| rowspan="4" |128-bit
 
| rowspan="4" |128-bit
  +
|0x04
|LQV
 
  +
|<code>lqv</code>
 
|Load (up to) 16 bytes into a VPR, left-aligned
 
|Load (up to) 16 bytes into a VPR, left-aligned
 
|-
 
|-
  +
|0x05
|LRV
 
  +
|<code>lrv</code>
 
|Load (up to) 16 bytes into a VPR, right-aligned
 
|Load (up to) 16 bytes into a VPR, right-aligned
 
|-
 
|-
  +
|0x04
|SQV
 
  +
|<code>sqv</code>
 
|Store (up to) 16 bytes from a VPR, left-aligned
 
|Store (up to) 16 bytes from a VPR, left-aligned
 
|-
 
|-
  +
|0x05
|SRV
 
  +
|<code>srv</code>
 
|Store (up to) 16 bytes from a VPR, right-aligned
 
|Store (up to) 16 bytes from a VPR, right-aligned
|-
 
| rowspan="3" |Transpose
 
|LTV
 
|Load 8 lanes from 8 GPRs into a VPR
 
|-
 
|STV
 
|Store 8 lanes of a VPR into 8 GPRs
 
|-
 
|SWV
 
|
 
 
|-
 
|-
 
| rowspan="4" |8-bit packed
 
| rowspan="4" |8-bit packed
  +
|0x06
|LPV / SPV
 
  +
|<code>lpv</code> / <code>spv</code>
 
|Load / store 8 8-bit signed values into a VPR
 
|Load / store 8 8-bit signed values into a VPR
 
|-
 
|-
  +
|0x07
|LUV / SUV
 
  +
|<code>luv</code> / <code>suv</code>
 
|Load / store 8 8-bit unsigned values into a VPR
 
|Load / store 8 8-bit unsigned values into a VPR
 
|-
 
|-
  +
|0x08
|LHV / SHV
 
  +
|<code>lhv</code> / <code>shv</code>
 
|Load / store 8 8-bit unsigned values into VPR, accessing every other byte in memory
 
|Load / store 8 8-bit unsigned values into VPR, accessing every other byte in memory
 
|-
 
|-
  +
|0x09
|LFV / SFV
 
  +
|<code>lfv</code> / <code>sfv</code>
 
|Load / store 4 8-bit unsigned values into VPR, accessing every fourth bytes in memory
 
|Load / store 4 8-bit unsigned values into VPR, accessing every fourth bytes in memory
  +
|-
  +
| rowspan="3" |Transpose
  +
|0x01
  +
|<code>swv</code>
  +
|
  +
|-
  +
|0x0B
  +
|<code>ltv</code>
  +
|Load 8 lanes from 8 GPRs into a VPR
  +
|-
  +
|0x0B
  +
|<code>stv</code>
  +
|Store 8 lanes of a VPR into 8 GPRs
 
|}
 
|}
   
Line 272: Line 294:
 
Single-lane instructions are an instruction group that perform operations on a single lange of a single input register (<code>VT<se></code>), and store the result into a single lane of a single output register (<code>VD<de></code>).
 
Single-lane instructions are an instruction group that perform operations on a single lange of a single input register (<code>VT<se></code>), and store the result into a single lane of a single output register (<code>VD<de></code>).
   
  +
Example syntax:<syntaxhighlight lang="asm">
<code>vt_elem</code> and <code>vd_elem</code> are used to compute <code>se</code> and <code>de</code> that is to specify which lane, respectively of the input and output register, is affected.
 
  +
vmov $v01, e(4), $v05, e(6)
  +
</syntaxhighlight>In this example, the value in lane <code>$v05<6></code> is moved to lane <code>$v01<4></code>. In the assembly syntax, the [[Reality Signal Processor/CPU Core#Broadcast modifiers|broadcast modifier syntax]] is used, but no actual broadcast is performed, as the instructions operate on the single specified lane. Only the single-lane broadcast modifiers (<code>e(0)</code> ... <code>e(7)</code>) are supported.+
   
  +
In the opcode, the fields <code>vt_elem</code> and <code>vd_elem</code> are used to compute <code>se</code> and <code>de</code> that is to specify which lane, respectively of the input and output register, is affected.
<code>vd_elem</code> is 4 bits long (range 0..15); the highest bit is always ignored so the destination lane <code>de</code> is computed from the lowest 3 bits.
 
   
  +
<code>vd_elem</code> is 5 bits long (range 0..31); the highest bits are always ignored, and the destination lane <code>de</code> is simply <code>vd_elem(2..0)</code>.
<code>vt_elem</code> is 5 bits long (range 0..31). <code>vt_elem(4)</code> must be zero. When <code>vt_elem(3)</code> is 1, <code>vt_elem(2..0)</code> is actually used as source lane <code>se</code>, as expected. When <code>vt_elem(3)</code> is 0, a hardware bug is triggered and portions of the lower bits of <code>vt_elem</code> are replaced with portion of the bits of <code>vd_elem</code> while computing <code>se</code>. Specifically, all bits in <code>vt_elem</code> from the topmost set bit and higher are replaced with the same-position bits in <code>vd_elem</code>. Notice that this behaviour is actually consistent with what happens when <code>vt_elem(3)</code> is 1, which means that there is no need to think of it as a special-case. Pseudo-code:
 
  +
  +
<code>vt_elem</code> is 4 bits long (range 0..15). When <code>vt_elem(3)</code> is 1, <code>vt_elem(2..0)</code> is actually used as source lane <code>se</code>, as expected. When <code>vt_elem(3)</code> is 0, a hardware bug is triggered and portions of the lower bits of <code>vt_elem</code> are replaced with portion of the bits of <code>vd_elem</code> while computing <code>se</code>. Specifically, all bits in <code>vt_elem</code> from the topmost set bit and higher are replaced with the same-position bits in <code>vd_elem</code>. Notice that this behaviour is actually consistent with what happens when <code>vt_elem(3)</code> is 1, which means that there is no need to think of it as a special-case. Pseudo-code:
 
<code>de(2..0) = vd_elem(2..0)
 
<code>de(2..0) = vd_elem(2..0)
 
msb = highest_set_bit(vt_elem)
 
msb = highest_set_bit(vt_elem)
 
se(2..0) = vd_elem(2..msb) || vt_elem(msb-1..0)</code>
 
se(2..0) = vd_elem(2..msb) || vt_elem(msb-1..0)</code>
TODO: complete analysis for <code>vt_elem(4)</code> == 1.
 
 
{| class="wikitable"
 
{| class="wikitable"
 
|+Single-lane instructions
 
|+Single-lane instructions
 
!Opcode
 
!Opcode
  +
!Instruction
 
!Description
 
!Description
 
|-
 
|-
  +
|0x33
|VMOV
 
  +
|<code>vmov</code>
 
|Copy one lane of a VPR into another VPR
 
|Copy one lane of a VPR into another VPR
 
|-
 
|-
  +
|0x30
|VRCP
 
  +
|<code>vrcp</code>
 
|Compute the 32-bit reciprocal of a 16-bit fixed point
 
|Compute the 32-bit reciprocal of a 16-bit fixed point
 
|-
 
|-
  +
|0x34
|VRSQ
 
  +
|<code>vrsq</code>
 
|Compute the 32-bit reciprocal square root of a 16-bit fixed point
 
|Compute the 32-bit reciprocal square root of a 16-bit fixed point
 
|-
 
|-
  +
|0x32
|VRCPH
 
  +
|<code>vrcph</code>
 
|Extract the higher 16-bit of the result of a previous VRCP
 
|Extract the higher 16-bit of the result of a previous VRCP
 
|-
 
|-
  +
|0x36
|VRSQH
 
  +
|<code>vrsqh</code>
 
|Extract the higher 16-bit of the result of a previous VRSQ
 
|Extract the higher 16-bit of the result of a previous VRSQ
 
|-
 
|-
  +
|0x31
|VRCPL
 
  +
|<code>vrcpl</code>
 
|Compute the 32-bit reciprocal of a 32-bit fixed point
 
|Compute the 32-bit reciprocal of a 32-bit fixed point
 
|-
 
|-
  +
|0x33
|VRSQL
 
  +
|<code>vrsql</code>
 
|Compute the 32-bit reciprocal square root of a 32-bit fixed point
 
|Compute the 32-bit reciprocal square root of a 32-bit fixed point
 
|-
 
|-
  +
|0x37
|VNOP
 
  +
|<code>vnop</code>
|No operation
 
  +
|No operation (?)
  +
|-
  +
|0x3F
  +
|<code>vnull</code>
  +
|No operation (?)
 
|}
 
|}
   
Line 331: Line 369:
 
Instructions have this general format:
 
Instructions have this general format:
 
<code>VINSN vd, vs, vt, e(…)</code>
 
<code>VINSN vd, vs, vt, e(…)</code>
where <code>e(…)</code> is the broadcast modifier (as found in other SIMD architectures), that modifies the access to <code>vt</code> duplicating some lanes and hiding others.
+
where <code>e(…)</code> is the [[Reality Signal Processor/CPU Core#Broadcast modifier|broadcast modifier]] (as found in other SIMD architectures), that modifies the access to <code>vt</code> duplicating some lanes and hiding others.
   
 
This is the list of opcodes in this group.
 
This is the list of opcodes in this group.
Line 340: Line 378:
 
|-
 
|-
 
|0x00
 
|0x00
|<code>VMULF</code>
+
|<code>vmulf</code>
 
|Vector multiply S1.15 * S1.15, with rounding and signed clamping
 
|Vector multiply S1.15 * S1.15, with rounding and signed clamping
 
|-
 
|-
 
|0x01
 
|0x01
|<code>VMULU</code>
+
|<code>vmulu</code>
 
|Vector multiply S1.15 * S1.15 with rounding and unsigned clamping
 
|Vector multiply S1.15 * S1.15 with rounding and unsigned clamping
 
|-
 
|-
 
|0x04
 
|0x04
|<code>VMUDL</code>
+
|<code>vmudl</code>
 
|Vector multiply U0.16 * U0.16 with signed clamping
 
|Vector multiply U0.16 * U0.16 with signed clamping
 
|-
 
|-
 
|0x05
 
|0x05
|<code>VMUDM</code>
+
|<code>vmudm</code>
 
|Vector multiply S0.16 * U0.16 with signed clamping
 
|Vector multiply S0.16 * U0.16 with signed clamping
 
|-
 
|-
 
|0x06
 
|0x06
|<code>VMUDN</code>
+
|<code>vmudn</code>
 
|Vector multiply U0.16 * S0.16 with signed clamping
 
|Vector multiply U0.16 * S0.16 with signed clamping
 
|-
 
|-
 
|0x07
 
|0x07
|<code>VMUDH</code>
+
|<code>vmudh</code>
 
|Vector multiply S0.16 * S0.16 with signed clamping
 
|Vector multiply S0.16 * S0.16 with signed clamping
 
|-
 
|-
 
|0x08
 
|0x08
|<code>VMACF</code>
+
|<code>vmacf</code>
 
|Like VMULF, but also add the result to the accumulator
 
|Like VMULF, but also add the result to the accumulator
 
|-
 
|-
 
|0x09
 
|0x09
|<code>VMACU</code>
+
|<code>vmacu</code>
 
|Like VMULU, but also add the result to the accumulator
 
|Like VMULU, but also add the result to the accumulator
 
|-
 
|-
 
|0x0C
 
|0x0C
|<code>VMADL</code>
+
|<code>vmadl</code>
 
|Like VMUDL, but also add the result to the accumulator
 
|Like VMUDL, but also add the result to the accumulator
 
|-
 
|-
 
|0x0D
 
|0x0D
|<code>VMADM</code>
+
|<code>vmadm</code>
 
|Like VMUDM, but also add the result to the accumulator
 
|Like VMUDM, but also add the result to the accumulator
 
|-
 
|-
 
|0x0E
 
|0x0E
|<code>VMADN</code>
+
|<code>vmadn</code>
 
|Like VMUDN, but also add the result to the accumulator
 
|Like VMUDN, but also add the result to the accumulator
 
|-
 
|-
 
|0x0F
 
|0x0F
|<code>VMADH</code>
+
|<code>vmadh</code>
 
|Like VMUDH, but also add the result to the accumulator
 
|Like VMUDH, but also add the result to the accumulator
 
|-
 
|-
 
|0x10
 
|0x10
|<code>VADD</code>
+
|<code>vadd</code>
 
|Vector add with carry
 
|Vector add with carry
  +
|-
  +
|0x13
  +
|<code>vabs</code>
  +
|Vector absolute value
 
|-
 
|-
 
|0x14
 
|0x14
|<code>VADDC</code>
+
|<code>vaddc</code>
 
|Vector add writing overflow into carry
 
|Vector add writing overflow into carry
 
|-
 
|-
 
|0x1D
 
|0x1D
|<code>VSAR</code>
+
|<code>vsar</code>
 
|Read a portion of the accumulator into a VPR
 
|Read a portion of the accumulator into a VPR
 
|-
 
|-
 
|0x28
 
|0x28
|<code>VAND</code>
+
|<code>vand</code>
 
|Vector bitwise and (<code>a & b</code>)
 
|Vector bitwise and (<code>a & b</code>)
 
|-
 
|-
 
|0x29
 
|0x29
|<code>VNAND</code>
+
|<code>vnand</code>
 
|Vector bitwise nand (<code>~(a & b)</code>)
 
|Vector bitwise nand (<code>~(a & b)</code>)
 
|-
 
|-
 
|0x2A
 
|0x2A
|<code>VOR</code>
+
|<code>vor</code>
 
|Vector bitwise or (<code><nowiki>a | b</nowiki></code>)
 
|Vector bitwise or (<code><nowiki>a | b</nowiki></code>)
 
|-
 
|-
 
|0x2B
 
|0x2B
|<code>VNOR</code>
+
|<code>vnor</code>
 
|Vector bitwise nor (<code><nowiki>~(a | b)</nowiki></code>)
 
|Vector bitwise nor (<code><nowiki>~(a | b)</nowiki></code>)
 
|-
 
|-
 
|0x2C
 
|0x2C
|<code>VXOR</code>
+
|<code>vxor</code>
 
|Vector bitwise xor (<code>a ^ b</code>)
 
|Vector bitwise xor (<code>a ^ b</code>)
 
|-
 
|-
 
|0x2D
 
|0x2D
|<code>VNXOR</code>
+
|<code>vnxor</code>
 
|Vector bitwise nxor (<code>~(a ^ b)</code>)
 
|Vector bitwise nxor (<code>~(a ^ b)</code>)
 
|}
 
|}
Line 443: Line 485:
 
|}
 
|}
 
Instructions have this general format:
 
Instructions have this general format:
<code>VINSN vd, vs, vt[element]</code>
+
<code>VINSN vd, vs, vt, e(…)</code>
where <code>element</code> is a "broadcast modifier" (as found in other SIMD architectures), that modifies the access to <code>vt</code> duplicating some lanes and hiding others. See the Computational instructions section for details.
+
where <code>e(…)</code> is the [[Reality Signal Processor/CPU Core#Broadcast modifier|broadcast modifier]] (as found in other SIMD architectures), that modifies the access to <code>vt</code> duplicating some lanes and hiding others. See the Computational instructions section for details.
   
 
This is the list of opcodes in this group:
 
This is the list of opcodes in this group:
Line 450: Line 492:
 
!Opcode
 
!Opcode
 
!Instruction
 
!Instruction
  +
!Description
 
|-
 
|-
 
|0x00
 
|0x00
|<code>VLT</code>
+
|<code>vlt</code>
  +
|Select the lower value between two VPR
 
|-
 
|-
 
|0x01
 
|0x01
|<code>VEQ</code>
+
|<code>veq</code>
  +
|Compare two VPR to check if they are equal
 
|-
 
|-
 
|0x02
 
|0x02
|<code>VNE</code>
+
|<code>vne</code>
  +
|Compare two VPR to check if they are different
 
|-
 
|-
 
|0x03
 
|0x03
|<code>VGE</code>
+
|<code>vge</code>
  +
|Select the greater or equal value between two VPR
 
|-
 
|-
 
|0x04
 
|0x04
|<code>VCL</code>
+
|<code>vcl</code>
  +
|Clip a VPR against two bounds (lower 16-bits)
 
|-
 
|-
 
|0x05
 
|0x05
|<code>VCH</code>
+
|<code>vch</code>
  +
|Clip a VPR against two bounds (higher 16-bits)
 
|-
 
|-
 
|0x06
 
|0x06
|<code>VCR</code>
+
|<code>vcr</code>
  +
|Clip a VPR against a pow-2 bound
 
|-
 
|-
 
|0x07
 
|0x07
|<code>VMRG</code>
+
|<code>vmrg</code>
  +
|Merge two VPR selecting each lane according to flags
 
|}
 
|}
  +
  +
==== VU/SU Moves ====
  +
{| class="wikitable"
  +
!31..26
  +
!25..21
  +
!20..16
  +
!15..11
  +
!10..8
  +
!7..0
  +
|-
  +
|<code>COP2</code>
  +
|<code>opcode</code>
  +
|<code>rt</code>
  +
|<code>vs</code>
  +
|<code>vs_elem</code>
  +
|0
  +
|}
  +
These are the standard MIPS opcodes for moving data in/out the coprocessor registers
  +
{| class="wikitable"
  +
!<code>opcode</code>
  +
!Instruction
  +
!Description
  +
|-
  +
|0x0
  +
|<code>mfc2</code>
  +
|Copy a lane of a VPR into a GPR
  +
|-
  +
|0x2
  +
|<code>cfc2</code>
  +
|Copy a VU control register into a GPR
  +
|-
  +
|0x4
  +
|<code>mtc2</code>
  +
|Copy a GPR into a lane of a VPR
  +
|-
  +
|0x6
  +
|<code>ctc2</code>
  +
|Copy a GPR into a VU control register
  +
|}
  +
Vector moves follow the same format as standard MIPS coprocessor moves, but use part of the lower 11 bits (which are normally unused) to specify which lane of the VPR is accessed. Notice that <code>vs_elem</code> specifies a byte offset (not a lane index!), so to copy a lane, <code>lane*2</code> must be specified.
  +
  +
This is an example using GNU syntax:<syntaxhighlight lang="asm">
  +
mtc2 a1, $v04,e(4)
  +
</syntaxhighlight>This example will copy the lower 16 bits of GPR <code>a1</code> into the fifth lane of <code>$v04</code>.
  +
  +
<code>mtc2</code> moves the lower 16 bits of the general purpose register <code>rt</code> to the bytes <code>VS[vs_elem+1..vs_elem]</code>.
  +
  +
<code>mfc2</code> moves the 2 bytes <code>VS[vs_elem+1..vs_elem]</code> to GPR <code>rt</code>, sign extending the 16 bits value to 64 bits.
  +
  +
<code>ctc2</code> moves the lower 16 bits of GPR <code>rt</code> into the control register specified by <code>vs</code>, while <code>cfc2</code> does the reverse, moving the control register specified by <code>vs</code> into GPR <code>rt</code>, sign extending to 64 bits. Note that both <code>ctc2</code> and <code>cfc2</code> ignore the <code>vs_elem</code> field. For these instructions, the control register is specified as follows:
  +
{| class="wikitable"
  +
!<code>vs</code>
  +
!Register
  +
|-
  +
|0
  +
|<code>VCO</code>
  +
|-
  +
|1
  +
|<code>VCC</code>
  +
|-
  +
|2
  +
|<code>VCE</code>
  +
|}
  +
  +
=== Instruction details ===
  +
  +
==== Scalar loads: LBV, LSV, LLV, LDV ====
  +
{| class="wikitable"
  +
!31..26
  +
!25..21
  +
!20..16
  +
!15..11
  +
!10..7
  +
!6..0
  +
|-
  +
|<code>LWC2</code>
  +
|<code>base</code>
  +
|<code>vt</code>
  +
|<code>opcode</code>
  +
|<code>element</code>
  +
|<code>offset</code>
  +
|}
  +
  +
===== Assembly =====
  +
<syntaxhighlight lang="asm">
  +
lsv $v01,e(2), 0,s0 ; Load the 16-bit word at s0 into the third lane of $v01
  +
lbv $v04,8, 0,s1 ; Load the 8-bit word at s1 into the 9th byte of $v04 (MSB of lane 4)
  +
</syntaxhighlight>Notice that it is possible to specify the lane syntax for the <code>element</code> field to refer to a specific lane, but if the access is made using <code>llv</code> or <code>ldv</code> (4 or 8 bytes), it will overflow into the following lanes.
  +
  +
===== Pseudo-code =====
  +
<syntaxhighlight lang="c">
  +
addr = GPR[base] + offset * access_size
  +
data = DMEM[addr..addr+access_size-1]
  +
VPR[vt][element..element+access_size-1] = data
  +
  +
  +
</syntaxhighlight>
  +
  +
===== Description =====
  +
These instructions load a scalar value (1, 2, 4, or 8 bytes) from DMEM into a VPR. Loads affect only a portion of the vector register (which is 128-bit); other bytes in the register are not modified.
  +
  +
The address in DMEM where the value is fetched is computed as <code>GPR[base] + (offset * access_size)</code>, where <code>access_size</code> is the number of bytes being accessed (eg: 4 for <code>llv</code>). The address can be misaligned: despite how memory accesses usually work on MIPS, these instructions perform unaligned memory accesses.
  +
  +
The part of the vector register being accessed is <code>VPR[vt][element..element+access_size]</code>, that is <code>element</code> selects the first accessed byte within the vector register. When <code>element+access_size</code> is bigger than 15, fewer bytes are processed (eg: <code>llv</code> with <code>element=13</code> only loads 3 byte from memory into <code>VPR[vt][13..15]</code>).
  +
  +
===== Usage =====
  +
These instructions are seldom used. Normally, it is better to structure RSP code to work across full vectors to maximize parallelism. Input data should already be provided in vectorized format by the CPU, so that it is possible to use a vector load (<code>lqv</code>, in case the input is made of 16-bit data) or a packed load (<code>luv</code>/<code>lpv</code>, in case the input is made of 8-bit data). Consider also using <code>mtc2</code> to load a 16-bit value into a lane of a VPR when the value is available in a GPR.
  +
  +
A possible use-case for these instructions is to reverse the order of the lanes. For instance, in audio codecs, windowing algorithms often work combining sequences audio samples with other sequences in reverse order. RSP does not have an instruction to reverse the order of the lanes, so in that case it might be necessary to manually reverse the lanes while loading using <code>lsv</code>:<syntaxhighlight lang="asm">
  +
lqv $v00, 0,s0 ; Load 8 16-bit samples from DMEM at address s0
  +
lsv $v01,e(7), 0,s1 ; Load 8 16-bit samples from DMEM at address s1 in reverse order
  +
lsv $v01,e(6), 2,s1
  +
lsv $v01,e(5), 4,s1
  +
lsv $v01,e(4), 6,s1
  +
lsv $v01,e(3), 8,s1
  +
lsv $v01,e(2), 10,s1
  +
lsv $v01,e(1), 12,s1
  +
lsv $v01,e(0), 14,s1
  +
</syntaxhighlight>
  +
  +
==== Scalar stores: SBV, SSV, SLV, SDV ====
  +
{| class="wikitable"
  +
!31..26
  +
!25..21
  +
!20..16
  +
!15..11
  +
!10..7
  +
!6..0
  +
|-
  +
|<code>SWC2</code>
  +
|<code>base</code>
  +
|<code>vt</code>
  +
|<code>opcode</code>
  +
|<code>element</code>
  +
|<code>offset</code>
  +
|}
  +
  +
===== Assembly =====
  +
<syntaxhighlight lang="asm">
  +
ssv $v01,e(2), 0,s0 ; Store the 16-bit word in the third lane of $v01 into DMEM at address s0
  +
sbv $v04,8, 0,s1 ; Store the 8-bit word in the 9th byte of $v04 (MSB of lane 4) into DMEM at address s1
  +
</syntaxhighlight>Notice that it is possible to specify the lane syntax for the <code>element</code> field to refer to a specific lane, but if the access is made using <code>slv</code> or <code>sdv</code> (4 or 8 bytes), it will overflow into the following lanes.
  +
  +
===== Pseudo-code =====
  +
<syntaxhighlight lang="c">
  +
addr = GPR[base] + offset * access_size
  +
data = VPR[vt][element..element+access_size-1]
  +
DMEM[addr..addr+access_size-1] = data
  +
  +
</syntaxhighlight>
  +
  +
===== Description =====
  +
These instructions store a scalar value (1, 2, 4, or 8 bytes) from a VPR into DMEM.
  +
  +
The address in DMEM where the value will be stored is computed as <code>GPR[base] + (offset * access_size)</code>, where <code>access_size</code> is the number of bytes being accessed (eg: 4 for <code>SLV</code>). The address can be misaligned: despite how memory accesses usually work on MIPS, these instructions perform unaligned memory accesses.
  +
  +
The part of the vector register being accessed is <code>VPR[vt][element..element+access_size]</code>, that is <code>element</code> selects the first accessed byte within the vector register. When <code>element+access_size</code> is bigger than 15, the element access wraps within the vector and a full-size store is always performed (eg: <code>slv</code> with <code>element=15</code> stores <code>VPR[vt][15,0..2]</code> into memory, for a total of 4 bytes).
  +
  +
===== Usage =====
  +
These instructions are seldom used. Normally, it is better to structure RSP code to work across full vectors to maximize parallelism. Data flow between RSP and VR4300 should be structured in vectorized format, so that it is possible to use a vector store (<code>sqv</code>, in case the output is made of 16-bit data) or a packed load (<code>suv</code>/<code>spv</code>, in case the output is made of 8-bit data). Consider also using <code>mfc2</code> to store a 16-bit value from the lane of a VPR into a GPR.
  +
  +
See [[Reality Signal Processor/CPU Core#Scalar loads: LBV, LSV, LLV, LDV|scalar loads]] for an example of a use-case (reversing a vector) that can be implemented also via <code>ssv</code>.
  +
  +
==== 128-bit vector loads: LQV, LRV ====
  +
{| class="wikitable"
  +
!31..26
  +
!25..21
  +
!20..16
  +
!15..11
  +
!10..7
  +
!6..0
  +
|-
  +
|<code>LWC2</code>
  +
|<code>base</code>
  +
|<code>vt</code>
  +
|<code>opcode</code>
  +
|<code>element</code>
  +
|<code>offset</code>
  +
|}
  +
{| class="wikitable"
  +
!Insn
  +
!<code>opcode</code>
  +
!Desc
  +
|-
  +
|<code>LQV</code>
  +
|0x04
  +
|load (up to) 16 bytes into vector, left-aligned
  +
|-
  +
|<code>LRV</code>
  +
|0x05
  +
|load (up to) 16 bytes into vector, right-aligned
  +
|}
  +
  +
===== Assembly =====
  +
<syntaxhighlight lang="asm">
  +
// Standard 128-bit load from DMEM aligned address s0 into $v08
  +
lqv $v08, 0,s0
  +
  +
// Loading a misaligned 128-bit vector from DMEM
  +
// (a0 is 128-bit aligned in this example)
  +
lqv $v00, 0x08,a0 // read bytes 0x08(a0) - 0x0F(a0) into left part of the vector (VPR[vt][0..7])
  +
lrv $v00, 0x18,a0 // read bytes 0x10(a0) - 0x17(a0) into right part of the vector (VPR[vt][8..15])
  +
  +
// Advanced example using the "element" field
  +
lqv $v08,e(2), 0x08,a0 // read bytes 0x08(a0) - 0x0F(a0) into VPR[vt][4..11]
  +
lrv $v08,e(2), 0x18,a0 // read bytes 0x10(a0) - 0x13(a0) into VPR[vt][12..15]
  +
  +
  +
</syntaxhighlight>Notice that the element field is optional (defaults to 0) and is usually not specified because these instructions are meant to affect the whole vector. The element field can be specified using the lane syntax (<code>e(N)</code>) or a raw number which maps to the byte offset inside the vector.
  +
  +
===== Pseudo-code =====
  +
<syntaxhighlight lang="c">
  +
// lqv
  +
addr = GPR[base] + (offset * 16)
  +
end = addr | 15
  +
size = MIN(end-addr, 15-element)
  +
VPR[vt][element..element+size] = DMEM[addr..addr+size]
  +
  +
</syntaxhighlight><syntaxhighlight lang="c">
  +
// lrv
  +
end = GPR[base] + (offset * 16)
  +
addr = end & ~16
  +
size = MIN(end-addr, 15-element)
  +
VPR[vt][element..addr+size] = DMEM[addr..addr+size]
  +
  +
</syntaxhighlight>
  +
  +
===== Description =====
  +
Roughly, these functions behave like <code>lwl</code> and <code>lwr</code>: combined, they allow to read 128 bits of data into a vector register, irrespective of the alignment.
  +
  +
When the data to be loaded is 128-bit aligned within DMEM, <code>lqv</code> is sufficient to read the whole vector (<code>lrv</code> in this case is redundant because it becomes a no-op).
  +
  +
The actual bytes accessed in DMEM depend on the instruction: for <code>lwv</code>, the bytes are those starting at <code>GPR[base] + (offset * 16)</code>, up to and excluding the next 128-bit aligned byte (<code>a0+0x10</code> in the above example); for <code>lrv</code>, the bytes are those starting at the previous 128-bit aligned byte (<code>a0+0x10</code> in the above example) up to and ''excluding'' <code>GPR[base] + (offset * 16)</code>. Again, this is exactly the same behavior of <code>lwl</code> and <code>lwr</code>, but for 128-bit aligned loads.
  +
  +
<code>element</code> is used as a byte offset within the vector register to specify the first byte affected by the operation; that is, the part of the vector being loaded with the instruction pair is <code>VPR[vt][element..15]</code>. Thus a non-zero element means that fewer bytes are loaded.
  +
  +
===== Usage =====
  +
<code>lqv</code> is the most standard way to fill a full VPR vector register loading its contents from DMEM. Given that it's usually possible to define the layout of data in DMEM, it is advisable to design it so that vectors are always aligned to 128-bit (16 bytes), using the <code>.align 4</code> directory: this allows to read the vector using just <code>lqv</code>, in 1 cycle (though the load has a 3-cycle latency like all instructions that write to a VPR).<syntaxhighlight lang="asm">
  +
.data
  +
  +
.align 4
  +
CONST: .half 3, 2, 7, 0, 0x4000, 0x8000, 0x7F, 0xFFF # Several constants used for an algorithm
  +
  +
.text
  +
  +
lqv $v31, %lo(CONST),r0 # Load the constants
  +
  +
</syntaxhighlight>One example of using <code>lqv</code> and <code>lrv</code> in pair is to perform a fast memcpy from a possible misaligned address to an aligned destination buffer:<syntaxhighlight lang="asm">
  +
# s0 is the pointer to source data in DMEM (possibly misaligned)
  +
# s4 will point to the destination buffer in DMEM (aligned)
  +
# t1 is the number of bytes to copy
  +
li s4, %lo(BUFFER)
  +
loop:
  +
lqv $v00, 0x00,s0
  +
lrv $v00, 0x10,s0
  +
sqv $v00, 0,s4
  +
sub t1, 16
  +
add s0, 16
  +
bgtz t1, loop
  +
add s4, 16
  +
  +
</syntaxhighlight>

Latest revision as of 21:22, 5 April 2022

Scalar unit (SU)

The scalar is the half of the RSP core that is similar to a standard MIPS R4000 32-bit CPU. It has 32 32-bit registers (conventionally called r0-r31) and implement most standard opcodes. This page does not describe the whole scalar unit as standard MIPS documentation suffices, but it highlights the main difference.

Missing opcodes

The following opcodes are not implemented by RSP:

  • Multiplication units. RSP does not have a multiplication unit so there is no MULT, MULTU, DIV, DIVU, MFHI, MFLO, MTHI, MTLO.
  • 64-bit instructions. RSP only has 32-bit scalar registers in SU, so there is no 64-bit opcodes (the ones starting with D such as DADDIU, DSRL, etc.) nor 64-bit memory accesses such as LD, SD, LDL, SDL.
  • No opcodes for misaligned memory accesses. All memory accesses to DMEM can be correctly performed also to misaligned addresses, using the standard opcodes like LW / SW or LH / LHU / SH, so there is no LWL, LWR, SWL, SWR.
  • No traps or exceptions. RSP does not implement any form of interrupt or exception handling, so there is no SYSCALL nor trap instructions (TGE, TLT, etc.). BREAK is available but it has a special behavior (see below).
  • No support for likely branches. The "likely" variant of all branches is not supported. The missing opcodes are the ones ending with L (such as BEQL, BLEZL, etc.)

Memory access

RSP is a harvard architecture. All opcodes are fetched from IMEM (4KB) and all data is access in DMEM (4KB).

The PC register is 12-bit. All higher address bits in branch / call instructions are thus ignored. When PC reaches the last opcode (at 0xFFC), execution continues to the first opcode in IMEM (PC wraps to 0x000).

All accesses to DMEM are performed using the lowest 12 bits of the address calculated by the load/store instruction (higher bits are ignored). Moreover, contrary to standard MIPS architecture, the RSP can correctly perform misaligned memory accesses (eg: it is possibly to fetch a 32-bit word at address 0x001, that will contain the 4 bytes at 0x1-0x5). Standard MIPS architecture allows to do misaligned addresses only using the LWL/LWR or SWL/SWR couples, which are not required on the RSP.

Vector Unit (VU)

The VU is the internal unit of the RSP CPU core that is able to perform fixed-point SIMD calculations. It is a proprietary design which does not follow any standard specification. Its opcodes and registers are exposed to the core via the COP2 interface.

Vector registers and glossary

VU contains 32 128-bit SIMD registers, each organized in 8 lanes of 16-bit each one. Most VU opcodes perform the same operation in parallel on each of the 8 lanes. The arrangement is thus similar to x86 SSE2 registers in EPI16 format.

The vector registers array is called VPR in this document, so VPR[4] refers to the fifth register (usually called v4 in assembly). When referring to specific portions of the register, we use the following convention:

  • VPR[vt][4..7] refers to byte indices, that is bytes from 4 to 7, counting from the higher part of the register (in big-endian order).
  • VPR[vt]<4..7> refers to specific lane indices, that is lanes from 4 to 7 counting from the higher part of the register (in big-endian order).
  • Within each lane, VPR[vt]<2>(3..0) refers to inclusive bit ranges. Notice that bits are counted as usual in little-endian order (bit 0 is the lowest, bit 15 is the highest), and thus they are written as (high..low).

Ranges are specified using the beg..end inclusive notation (that is, both beg and end are part of the range).

The concatenation of disjoint ranges is written with a ,, for instance: [0..3,8..11] means 8 bytes formed by concatenating 4 bytes starting at 0 with 4 bytes starting at 8.

Vector lanes are usually interpreted as a fixed point number. As a homebrew programmer, it is useful to understand the meaning of each opcode and its correct usage while writing code, which goes beyond the exact hardware description of how bits are shuffled around. To refer to a fixed point number, we use the syntax S1.15 where "S" means "signed" (while "U" is "unsigned"), "1" is the number of bits for the integral part, and "15" are the number of bits for the fractional part.

Accumulator

The RSP contains a 8-lane SIMD accumulator, that is used implicitly by multiplication opcodes. Each of the 8 lanes is 48-bits wide, that allows to accumulate intermediate results of calculations without the loss of precision that would incur when storing them into a 16-bit lane in a vector register.

It is possible to extract the contents of the accumulator through the VSAR opcode; one call to this opcode can extract a 16-bit portion of each lane and store it into the specified vector register. The three portions are conventionally called ACCUM_LO (bits 15..0 of each lane), ACCUM_MD (bits 31..16 of each lane), and ACCUM_HI (bits 47..32 of each lane).

If you exclude the VSAR instruction that cuts the accumulator piecewise for extracting it, it is better to think of it a single register where each lane is 48-bits wide.

Control registers

The VU contains 3 16-bit control registers: VCC, VCO, VCE.

These registers are used as flag registers by several opcodes. As with most flags, even though they have a general meaning that is generally valid, they tend to also be used in some mind-twisting way to obtain the desired result. It doesn't really make sense to try to describe them at the general level and instead each instruction will explain if/how it uses or modifies the control registers.

To read/write the contents of the control registers, the ctc2 / cfc2 instructions can be used.

Clamping

Multiplication opcodes perform a clamping step when extracting the accumulator into a vector register. Notice that each lane of the accumulator is always treated as a signed 48-bit number.

This is the pseudo-code for signed clamping (no surprises):

function clamp_signed(accum)
    if accum < -32768  => return -32768
    if accum > 32767   => return 32767
    return accum

The returned value is thus always within the signed 16-bit range.

This is the pseudo-code for unsigned clamping:

function clamp_unsigned(accum)
    if accum < 0       => return 0
    if accum > 32767   => return 65535
    return accum

Notice that in unsigned clamping, the saturating threshold is 15-bit, but the saturated value is 16-bit.

Broadcast modifier

Some family of instructions in VU (specifically, the computational instructions and the select instructions) allow to apply a "broadcast modifier" to one of the input registers. For instance, in this instruction
vaddc $v01, $v04, e(1)
e(1) is the broadcast modifier. Normally, the instruction would add the two registers lane by lane; with the modifier, the second lane (index 1) of $v04 is added to all lanes of $v01.

The modifier is stored in the element field of the opcode

element GNU syntax SGI syntax Lanes being accessed Description
0 0,1,2,3,4,5,6,7 Normal register access (no broadcast)
1 0,1,2,3,4,5,6,7 Normal register access (no broadcast)
2 e(0q) [0q] 0,0,2,2,4,4,6,6 Broadcast 4 of 8 lanes
3 e(1q) [1q] 1,1,3,3,5,5,7,7 Broadcast 4 of 8 lanes
4 e(0h) [0h] 0,0,0,0,4,4,4,4 Broadcast 2 of 8 lanes
5 e(1h) [1h] 1,1,1,1,5,5,5,5 Broadcast 2 of 8 lanes
6 e(2h) [2h] 2,2,2,2,6,6,6,6 Broadcast 2 of 8 lanes
7 e(3h) [3h] 3,3,3,3,7,7,7,7 Broadcast 2 of 8 lanes
8 e(0) [0] 0,0,0,0,0,0,0,0 Broadcast single lane
9 e(1) [1] 1,1,1,1,1,1,1,1 Broadcast single lane
10 e(2) [2] 2,2,2,2,2,2,2,2 Broadcast single lane
11 e(3) [3] 3,3,3,3,3,3,3,3 Broadcast single lane
12 e(4) [4] 4,4,4,4,4,4,4,4 Broadcast single lane
13 e(5) [5] 5,5,5,5,5,5,5,5 Broadcast single lane
14 e(6) [6] 6,6,6,6,6,6,6,6 Broadcast single lane
15 e(7) [7] 7,7,7,7,7,7,7,7 Broadcast single lane

Instructions overview

Loads and stores

31..26 25..21 20..16 15..11 10..7 6..0
LWC2 or SWC2 base vt opcode element offset

The instructions perform a load/store from DMEM into/from a vector register.

  • base is the index of a scalar register used as base for the memory access
  • offset is a signed offset added to the value of the base register (with some scaling, depending on the actual instruction).
  • vt is the vector register.
  • element is used to index a specific byte/word within the vector register, usually specifying the first element affected by the operation (thus allows to access sub-portions of the vector register).
List of all loads and stores opcodes
Group Opcode Instruction Description
Scalar 0x00 lbv / sbv Load / Store 1 byte into/from a VPR
0x01 lsv / ssv Load / Store 2 bytes into/from a VPR
0x02 llv / slv Load / Store 4 bytes into/from a VPR
0x03 ldv / sdv Load / Store 8 bytes into/from a VPR
128-bit 0x04 lqv Load (up to) 16 bytes into a VPR, left-aligned
0x05 lrv Load (up to) 16 bytes into a VPR, right-aligned
0x04 sqv Store (up to) 16 bytes from a VPR, left-aligned
0x05 srv Store (up to) 16 bytes from a VPR, right-aligned
8-bit packed 0x06 lpv / spv Load / store 8 8-bit signed values into a VPR
0x07 luv / suv Load / store 8 8-bit unsigned values into a VPR
0x08 lhv / shv Load / store 8 8-bit unsigned values into VPR, accessing every other byte in memory
0x09 lfv / sfv Load / store 4 8-bit unsigned values into VPR, accessing every fourth bytes in memory
Transpose 0x01 swv
0x0B ltv Load 8 lanes from 8 GPRs into a VPR
0x0B stv Store 8 lanes of a VPR into 8 GPRs

Single-lane instructions

31..26 25 24..21 20..16 15..11 10..6 5..0
COP2 1 vt_elem vt vd_elem vd opcode

Single-lane instructions are an instruction group that perform operations on a single lange of a single input register (VT<se>), and store the result into a single lane of a single output register (VD<de>).

Example syntax:
vmov $v01, e(4), $v05, e(6)
In this example, the value in lane $v05<6> is moved to lane $v01<4>. In the assembly syntax, the broadcast modifier syntax is used, but no actual broadcast is performed, as the instructions operate on the single specified lane. Only the single-lane broadcast modifiers (e(0) ... e(7)) are supported.+

In the opcode, the fields vt_elem and vd_elem are used to compute se and de that is to specify which lane, respectively of the input and output register, is affected.

vd_elem is 5 bits long (range 0..31); the highest bits are always ignored, and the destination lane de is simply vd_elem(2..0).

vt_elem is 4 bits long (range 0..15). When vt_elem(3) is 1, vt_elem(2..0) is actually used as source lane se, as expected. When vt_elem(3) is 0, a hardware bug is triggered and portions of the lower bits of vt_elem are replaced with portion of the bits of vd_elem while computing se. Specifically, all bits in vt_elem from the topmost set bit and higher are replaced with the same-position bits in vd_elem. Notice that this behaviour is actually consistent with what happens when vt_elem(3) is 1, which means that there is no need to think of it as a special-case. Pseudo-code:

de(2..0) = vd_elem(2..0)
msb = highest_set_bit(vt_elem)
se(2..0) = vd_elem(2..msb) || vt_elem(msb-1..0)
Single-lane instructions
Opcode Instruction Description
0x33 vmov Copy one lane of a VPR into another VPR
0x30 vrcp Compute the 32-bit reciprocal of a 16-bit fixed point
0x34 vrsq Compute the 32-bit reciprocal square root of a 16-bit fixed point
0x32 vrcph Extract the higher 16-bit of the result of a previous VRCP
0x36 vrsqh Extract the higher 16-bit of the result of a previous VRSQ
0x31 vrcpl Compute the 32-bit reciprocal of a 32-bit fixed point
0x33 vrsql Compute the 32-bit reciprocal square root of a 32-bit fixed point
0x37 vnop No operation (?)
0x3F vnull No operation (?)

Computational instructions

31..26 25 24..21 20..16 15..11 10..6 5..0
COP2 1 element vt vs vd opcode

Instructions have this general format:

VINSN vd, vs, vt, e(…)

where e(…) is the broadcast modifier (as found in other SIMD architectures), that modifies the access to vt duplicating some lanes and hiding others.

This is the list of opcodes in this group.

Opcode Instruction Description
0x00 vmulf Vector multiply S1.15 * S1.15, with rounding and signed clamping
0x01 vmulu Vector multiply S1.15 * S1.15 with rounding and unsigned clamping
0x04 vmudl Vector multiply U0.16 * U0.16 with signed clamping
0x05 vmudm Vector multiply S0.16 * U0.16 with signed clamping
0x06 vmudn Vector multiply U0.16 * S0.16 with signed clamping
0x07 vmudh Vector multiply S0.16 * S0.16 with signed clamping
0x08 vmacf Like VMULF, but also add the result to the accumulator
0x09 vmacu Like VMULU, but also add the result to the accumulator
0x0C vmadl Like VMUDL, but also add the result to the accumulator
0x0D vmadm Like VMUDM, but also add the result to the accumulator
0x0E vmadn Like VMUDN, but also add the result to the accumulator
0x0F vmadh Like VMUDH, but also add the result to the accumulator
0x10 vadd Vector add with carry
0x13 vabs Vector absolute value
0x14 vaddc Vector add writing overflow into carry
0x1D vsar Read a portion of the accumulator into a VPR
0x28 vand Vector bitwise and (a & b)
0x29 vnand Vector bitwise nand (~(a & b))
0x2A vor Vector bitwise or (a | b)
0x2B vnor Vector bitwise nor (~(a | b))
0x2C vxor Vector bitwise xor (a ^ b)
0x2D vnxor Vector bitwise nxor (~(a ^ b))

Select instructions

31..26 25 24..21 20..16 15..11 10..6 5..0
COP2 1 element vt vs vd opcode

Instructions have this general format:

VINSN vd, vs, vt, e(…)

where e(…) is the broadcast modifier (as found in other SIMD architectures), that modifies the access to vt duplicating some lanes and hiding others. See the Computational instructions section for details.

This is the list of opcodes in this group:

Opcode Instruction Description
0x00 vlt Select the lower value between two VPR
0x01 veq Compare two VPR to check if they are equal
0x02 vne Compare two VPR to check if they are different
0x03 vge Select the greater or equal value between two VPR
0x04 vcl Clip a VPR against two bounds (lower 16-bits)
0x05 vch Clip a VPR against two bounds (higher 16-bits)
0x06 vcr Clip a VPR against a pow-2 bound
0x07 vmrg Merge two VPR selecting each lane according to flags

VU/SU Moves

31..26 25..21 20..16 15..11 10..8 7..0
COP2 opcode rt vs vs_elem 0

These are the standard MIPS opcodes for moving data in/out the coprocessor registers

opcode Instruction Description
0x0 mfc2 Copy a lane of a VPR into a GPR
0x2 cfc2 Copy a VU control register into a GPR
0x4 mtc2 Copy a GPR into a lane of a VPR
0x6 ctc2 Copy a GPR into a VU control register

Vector moves follow the same format as standard MIPS coprocessor moves, but use part of the lower 11 bits (which are normally unused) to specify which lane of the VPR is accessed. Notice that vs_elem specifies a byte offset (not a lane index!), so to copy a lane, lane*2 must be specified.

This is an example using GNU syntax:
mtc2 a1, $v04,e(4)
This example will copy the lower 16 bits of GPR a1 into the fifth lane of $v04.

mtc2 moves the lower 16 bits of the general purpose register rt to the bytes VS[vs_elem+1..vs_elem].

mfc2 moves the 2 bytes VS[vs_elem+1..vs_elem] to GPR rt, sign extending the 16 bits value to 64 bits.

ctc2 moves the lower 16 bits of GPR rt into the control register specified by vs, while cfc2 does the reverse, moving the control register specified by vs into GPR rt, sign extending to 64 bits. Note that both ctc2 and cfc2 ignore the vs_elem field. For these instructions, the control register is specified as follows:

vs Register
0 VCO
1 VCC
2 VCE

Instruction details

Scalar loads: LBV, LSV, LLV, LDV

31..26 25..21 20..16 15..11 10..7 6..0
LWC2 base vt opcode element offset
Assembly
lsv $v01,e(2), 0,s0    ; Load the 16-bit word at s0 into the third lane of $v01
lbv $v04,8, 0,s1       ; Load the 8-bit word at s1 into the 9th byte of $v04 (MSB of lane 4)
Notice that it is possible to specify the lane syntax for the element field to refer to a specific lane, but if the access is made using llv or ldv (4 or 8 bytes), it will overflow into the following lanes.
Pseudo-code
addr = GPR[base] + offset * access_size
data = DMEM[addr..addr+access_size-1]
VPR[vt][element..element+access_size-1] = data
Description

These instructions load a scalar value (1, 2, 4, or 8 bytes) from DMEM into a VPR. Loads affect only a portion of the vector register (which is 128-bit); other bytes in the register are not modified.

The address in DMEM where the value is fetched is computed as GPR[base] + (offset * access_size), where access_size is the number of bytes being accessed (eg: 4 for llv). The address can be misaligned: despite how memory accesses usually work on MIPS, these instructions perform unaligned memory accesses.

The part of the vector register being accessed is VPR[vt][element..element+access_size], that is element selects the first accessed byte within the vector register. When element+access_size is bigger than 15, fewer bytes are processed (eg: llv with element=13 only loads 3 byte from memory into VPR[vt][13..15]).

Usage

These instructions are seldom used. Normally, it is better to structure RSP code to work across full vectors to maximize parallelism. Input data should already be provided in vectorized format by the CPU, so that it is possible to use a vector load (lqv, in case the input is made of 16-bit data) or a packed load (luv/lpv, in case the input is made of 8-bit data). Consider also using mtc2 to load a 16-bit value into a lane of a VPR when the value is available in a GPR.

A possible use-case for these instructions is to reverse the order of the lanes. For instance, in audio codecs, windowing algorithms often work combining sequences audio samples with other sequences in reverse order. RSP does not have an instruction to reverse the order of the lanes, so in that case it might be necessary to manually reverse the lanes while loading using lsv:
lqv $v00, 0,s0         ; Load 8 16-bit samples from DMEM at address s0
lsv $v01,e(7),  0,s1   ; Load 8 16-bit samples from DMEM at address s1 in reverse order
lsv $v01,e(6),  2,s1
lsv $v01,e(5),  4,s1
lsv $v01,e(4),  6,s1
lsv $v01,e(3),  8,s1
lsv $v01,e(2), 10,s1
lsv $v01,e(1), 12,s1
lsv $v01,e(0), 14,s1

Scalar stores: SBV, SSV, SLV, SDV

31..26 25..21 20..16 15..11 10..7 6..0
SWC2 base vt opcode element offset
Assembly
ssv $v01,e(2), 0,s0    ; Store the 16-bit word in the third lane of $v01 into DMEM at address s0
sbv $v04,8, 0,s1       ; Store the 8-bit word in the 9th byte of $v04 (MSB of lane 4) into DMEM at address s1
Notice that it is possible to specify the lane syntax for the element field to refer to a specific lane, but if the access is made using slv or sdv (4 or 8 bytes), it will overflow into the following lanes.
Pseudo-code
addr = GPR[base] + offset * access_size
data = VPR[vt][element..element+access_size-1]
DMEM[addr..addr+access_size-1] = data
Description

These instructions store a scalar value (1, 2, 4, or 8 bytes) from a VPR into DMEM.

The address in DMEM where the value will be stored is computed as GPR[base] + (offset * access_size), where access_size is the number of bytes being accessed (eg: 4 for SLV). The address can be misaligned: despite how memory accesses usually work on MIPS, these instructions perform unaligned memory accesses.

The part of the vector register being accessed is VPR[vt][element..element+access_size], that is element selects the first accessed byte within the vector register. When element+access_size is bigger than 15, the element access wraps within the vector and a full-size store is always performed (eg: slv with element=15 stores VPR[vt][15,0..2] into memory, for a total of 4 bytes).

Usage

These instructions are seldom used. Normally, it is better to structure RSP code to work across full vectors to maximize parallelism. Data flow between RSP and VR4300 should be structured in vectorized format, so that it is possible to use a vector store (sqv, in case the output is made of 16-bit data) or a packed load (suv/spv, in case the output is made of 8-bit data). Consider also using mfc2 to store a 16-bit value from the lane of a VPR into a GPR.

See scalar loads for an example of a use-case (reversing a vector) that can be implemented also via ssv.

128-bit vector loads: LQV, LRV

31..26 25..21 20..16 15..11 10..7 6..0
LWC2 base vt opcode element offset
Insn opcode Desc
LQV 0x04 load (up to) 16 bytes into vector, left-aligned
LRV 0x05 load (up to) 16 bytes into vector, right-aligned
Assembly
// Standard 128-bit load from DMEM aligned address s0 into $v08
lqv $v08, 0,s0

// Loading a misaligned 128-bit vector from DMEM
// (a0 is 128-bit aligned in this example)
lqv $v00, 0x08,a0    // read bytes 0x08(a0) - 0x0F(a0) into left part of the vector (VPR[vt][0..7])
lrv $v00, 0x18,a0    // read bytes 0x10(a0) - 0x17(a0) into right part of the vector (VPR[vt][8..15])

// Advanced example using the "element" field
lqv $v08,e(2), 0x08,a0   // read bytes 0x08(a0) - 0x0F(a0) into VPR[vt][4..11]
lrv $v08,e(2), 0x18,a0   // read bytes 0x10(a0) - 0x13(a0) into VPR[vt][12..15]
Notice that the element field is optional (defaults to 0) and is usually not specified because these instructions are meant to affect the whole vector. The element field can be specified using the lane syntax (e(N)) or a raw number which maps to the byte offset inside the vector.
Pseudo-code
// lqv
addr = GPR[base] + (offset * 16)
end = addr | 15
size = MIN(end-addr, 15-element)
VPR[vt][element..element+size] = DMEM[addr..addr+size]
// lrv
end = GPR[base] + (offset * 16)
addr = end & ~16
size = MIN(end-addr, 15-element)
VPR[vt][element..addr+size] = DMEM[addr..addr+size]
Description

Roughly, these functions behave like lwl and lwr: combined, they allow to read 128 bits of data into a vector register, irrespective of the alignment.

When the data to be loaded is 128-bit aligned within DMEM, lqv is sufficient to read the whole vector (lrv in this case is redundant because it becomes a no-op).

The actual bytes accessed in DMEM depend on the instruction: for lwv, the bytes are those starting at GPR[base] + (offset * 16), up to and excluding the next 128-bit aligned byte (a0+0x10 in the above example); for lrv, the bytes are those starting at the previous 128-bit aligned byte (a0+0x10 in the above example) up to and excluding GPR[base] + (offset * 16). Again, this is exactly the same behavior of lwl and lwr, but for 128-bit aligned loads.

element is used as a byte offset within the vector register to specify the first byte affected by the operation; that is, the part of the vector being loaded with the instruction pair is VPR[vt][element..15]. Thus a non-zero element means that fewer bytes are loaded.

Usage
lqv is the most standard way to fill a full VPR vector register loading its contents from DMEM. Given that it's usually possible to define the layout of data in DMEM, it is advisable to design it so that vectors are always aligned to 128-bit (16 bytes), using the .align 4 directory: this allows to read the vector using just lqv, in 1 cycle (though the load has a 3-cycle latency like all instructions that write to a VPR).
    .data

    .align 4
CONST:  .half 3, 2, 7, 0, 0x4000, 0x8000, 0x7F, 0xFFF      # Several constants used for an algorithm

    .text
    
    lqv $v31, %lo(CONST),r0    # Load the constants
One example of using lqv and lrv in pair is to perform a fast memcpy from a possible misaligned address to an aligned destination buffer:
    # s0 is the pointer to source data in DMEM (possibly misaligned)
    # s4 will point to the destination buffer in DMEM (aligned)
    # t1 is the number of bytes to copy
    li s4, %lo(BUFFER)
loop:
    lqv $v00, 0x00,s0
    lrv $v00, 0x10,s0
    sqv $v00, 0,s4
    sub t1, 16
    add s0, 16
    bgtz t1, loop
    add s4, 16