Peripheral Interface: Difference between revisions

Full description of PI DMA transfer process
(Full description of PI DMA transfer process)
Line 533:
 
 
=Aligned DMA TransferTransfers=
PI DMA is well defined for so-called "aligned transfers", which are defined by the following constraints:
An aligned DMA transfer is when the PI_DRAM_ADDR_REG is set to a 64bit (8byte) aligned address. The PI_CART_ADDR_REG can be any 16bit (2Byte) value as will transfer from that offset to RDRAM.
 
# RDRAM address must be 8 bytes aligned
The PI_RD_LEN_REG and PI_WR_LEN_REG can be any length, as long as it is a 2 byte aligned amount (more testing is to be done on this to confirm this)
# PI address must be 2 bytes aligned
# Length must be a multiple of 2
 
Notice that the second point might be considered redundant from a hardware point of view given that both registers holding addresses are fixed to be 2-byte aligned (LSB is fixed to 0), but from a software point of view, this has to be taken into account.
= Unaligned DMA transfer =
An un-aligned ROM dma transfer is when you use the PI_DRAM_ADDR_REG and not set it as a 8 Byte aligned address and use variable PI_RD_LEN_REG and PI_WR_LEN_REG lengths.
 
The behavior of PI DMA when the first and third constraint are not respected is not well designed; it seems like the designers attempted to implement support for loosing these constraints but gave up in the middle, leaving the hardware in a state that can only be described as "buggy". This also leaks some internal details on how the transfers are performed.
The following rules are based on assumptions via the created test ROMs by Krom, Mazamars312 and Lemmy ([https://github.com/PeterLemon/N64/tree/master/CPUTest/DMAAlignment-PI-cart https://github.com/PeterLemon/N64/tree/master/CPUTest/DMAAlignment-PI-cart)]
{| class="wikitable"
|+This are example DMA transfers
!RDRAM Address
!ROM Address
!Read or Write
!Length
!What happens
|-
|0000_0100
|1000_1000
|Read
|0x7F (128 Bytes)
|This is a normal aligned transfer
|-
|0000_0102
|1000_1000
|Read
|0x7F (128 Bytes)
|The start of the ROM data is transferred to RDRAM offset as expected (So the first two bytes of RDRAM are not affected by this write).
However, this is where we see that the last 2 bytes are dropped from the transfer. Thus only making it a 0x7D length transfer (126 bytes - 1)
|-
|0000_0106
|1000_1000
|Read
|0x7F (128 Bytes)
|The start of the ROM data is transferred to RDRAM offset as expected (So the first 6 bytes of RDRAM are not affected by this write).
However, this is where we see that the last 6 bytes are dropped from the transfer. Thus only making it a 0x79 length transfer (122 bytes - 1)
|-
|0000_0106
|1000_1000
|Read
|0x17 (24 Bytes)
|The start of the ROM data is transferred to RDRAM offset as expected (So the first 6 bytes of RDRAM are not affected by this write).
However, this is where we see that the last 6 bytes are dropped from the transfer. Thus only making it a 0x11 length transfer (18 bytes - 1)
|-
|0000_0106
|1000_1000
|Read
|0xFF (256 Bytes)
|This is where we have found that internally the N64 can only DMA blocks of 128 at a time to and from RDRAM as a burst to the PI controller.
The First 128 Bytes:
 
To implement PI DMA, the RCP uses an internal 128 byte buffer. The following section attempts to describe the exact process (though the *actual* process implemented in the hardware is unknown; the following does match in observable behavior).
The start of the ROM data is transferred to RDRAM offset as expected (So the first two bytes of RDRAM are not affected by this write).
 
NOTE: only DMA write transfers (PI -> RDRAM) have been analyzed in detail, using default PI DOM1 settings. It is expected that read transfers (RDRAM -> PI) behave in a specular way, though it's not been fully tested yet. We also expect PI DOM1 page size setting to somehow affect the transfer, though this has also not been explored yet.
However, this is where we see that the last 6 bytes are dropped from the transfer. Thus only making it a 0x79 length transfer (122 bytes - 1)
 
==== Internal process ====
The Second 128 Bytes:
The transfer is split in blocks of maximum 128 bytes each one. Within each block, the PI first fills the internal buffer fetching data from the PI bus, and then write backs the buffer contents to RDRAM. This can be observed by monitoring PI_DRAM_ADDR and PI_CART_ADDR: during the transfer, it can be first seen PI_CART_ADDR moving forward, and
 
then PI_DRAM_ADDR catching up with a leap (writing to RDRAM is much faster than reading PI).
This will do a normal Aligned DMA transfer from the RDRAM offset 128 to 255. From this we believe the first DMA transfer is corrupted due to some internal issue with the PI controller and the RDRAM controller.
 
In general for all blocks of the transfer (excluding the first one, see below), the logic appears to be as follows:
The image blow shows this example (look at address 112 -> 127) this shows the last 6 Bytes are not transferred. (Confirmed by Krom)[[File:DMA UnAligned 6byte Offset.png|center|thumb|300x300px]]
 
|-
*   Compute the block size. This is the smallest between the remaining length, the end of the current RDRAM page, and 128 bytes (which is the maximum size of the internal buffer). RDRAM pages are 2 KiB (0x800) long, so for instance if the current RDRAM address (at the beginning of the block) is 0x147e0, the block size will be 0x20 because the RDRAM page ends at 0x147ff.
|0000_0106
* Fill the page using PI reads from the bus. All PI accesses are always 16-bit long, so if the block size was odd (which happens on the last block, if the remaining length is odd), one extra byte will be fetched from PI into the internal buffer.
|1000_1000
* Write back into RDRAM. The exact format of RDRAM writes is unknown at the moment; since PI DMA transfers are well-defined for 8-byte aligned RDRAM addresses, it is assumed that 64-bit writes are used (a burst like that used for D/I cache writebacks would require 16-byte alignment or more to be performed). If an extra byte was fetched in the previous step, that byte is also written to RDRAM. So in general odd-length PI DMA transfers will transfer one byte more than requested.
|Write
 
|0x7F (128 Bytes)
The above logic applies for all blocks of the transfer, excluding the first one. The first block in fact is treated specially by PI. It appears that the goal of the designers was to use the first block to realign transfers to 8-byte in RDRAM, which possibly causes the first block to use smaller, masked writes to RDRAM. So, even if the RDRAM starting address is misaligned, all blocks besides the first one will begin from a 8-byte aligned RDRAM address, and behave with the logic described above.
|*** Writes to Flash and SRAM to be tested ***
 
|}
==== Internal process: first block ====
These are the differences in logic while processing the first block, which mostly concerns how to handle the initial RDRAM misalignment. In this description, we refer to ''RDRAM misalignment'' as the amount of bytes that the RDRAM address is distant from the previous 8-byte aligned word (that is, the misalignment is the value of the last 3 bits of the RDRAM address). Notice that the RDRAM address hardware register has the LSB fixed 0, so misalignment can be either 2, 4, or 6.
 
* The internal 128 byte buffer is filled starting from the index matching the misalignment. This might affect the maximum size of the first block: for instance, if misalignment is 6, the maximum size is not 128 but 122, because the first 6 bytes are skipped.
* Writes to RDRAM seems to use some kind of masking, so they are correctly done at the byte granularity. This means that odd length transfers in the first block appear to work correctly. Notice that this applies only to the first block whatever its size is; the size (as described above) might be limited by the end of the RDRAM page, in which case only odd transfers up to there are working correctly.
* As an exception to the above exception, if the first block reaches the end of the 128 byte buffer, the last 2 bytes of the buffer are always written back in full to RDRAM, even though one less byte was requested.
** Example: PI DMA transfer with misalignment 0 and RDRAM page end far away. Odd lengths up to 125 (included) work correctly; odd transfers of exactly 127 bytes are rounded up to 128 (since they reach the last 16-bit word of the buffer). Also odd transfers of 129 or more, since they need two blocks to be performed, fall back into the general rule where one more byte is transferred.
** Example: PI DMA transfer with misalignment 6 and RDRAM page end far away. Odd lengths up to 119 (included) work correctly; odd transfers of exactly 121 bytes are rounded up to 122 (since they reach the last 16-bit word of the buffer). Also odd transfers of 123 or more, since they need two blocks to be performed, fall back into the general rule where one more byte is transferred.
* There seems to be a hardware bug related how RDRAM writes are performed, in case of misaligned addresses. It seems like the hardware is counting the block length starting from index 0 of the buffer, even though the first byte was actually placed at the index matching the misalignment, and even though masking is performed correctly. This means that for instance, if misalignment is 6 and the length of 8, the following happens:
** First, 8 bytes are fetched from the PI bus and put at index 6..13 in the internal buffer.
** Then, RDRAM writes are performed but the hardware believes the block ends at index 8, so only bytes 6..8 are written back to RDRAM.
* Symmetrically, if the buffer is full (128 bytes), the last 6 bytes will not be transferred because of the same bug (even if those bytes were fetched by the PI bus). So there will be a "hole" of 6 bytes in the RDRAM output buffer. For instance, if misalignment is 6 and the length is 1024, and the RDRAM page end is far away, the following happens on the first block:
** Block size is computed as 122 bytes.
** 122 bytes are fetched from the PI bus, and put at index 6..127 in the internal buffer.
** RDRAM writes are performed but the hardware believes that the block ends at index 121, so only bytes 6..121 are written back to RDRAM.
** Notice that, this notwithstanding, RDRAM address is correctly rounded up to 8 byte at the end of the block (see below), so the second block will behave correctly. There will be a hole in RDRAM as bytes 122.127 in the first block are never written back to RDRAM, so the content of RDRAM for those bytes is not affected by DMA.
* RDRAM address register is always rounded up to the next 8 byte alignment at the end of the first block. In most normal cases, the logic above already ensures that the address ends up being aligned at the end of the block, but the rounding up happens even in cases like short transfers that ends with the first block at ends at an arbitrary byte.
 
==== Followup transfers ====
After a DMA transfer is finished, it is possible to trigger a "follwup transfer", that is a transfer that sequentially continues the previous one, by simply writing a new length to the PI_WR_LEN register. In this case, the current values of PI_DRAM_ADDR and PI_CART_ADDR are used at the beginning of the transfers. Those values will match the last addresses as updated by the first transfer.
 
The above section describes in details how PI reads and RDRAM writes are done, and registers are updated, so they also implicitly describe how a followup transfer behaves in various edge cases (short transfers, misaligned transfers, etc.)
 
==== PI_WR_LEN readbacks after a transfer ====
Reading back PI_WR_LEN after a transfer is done, appears to always be fixed at 0x7F. The only exception that has been noticed is when the transfer was smaller than 8 bytes: in that case, the value is 0x7F minus the initial RDRAM misalignment. For instance, if the RDRAM misalignment was 4, the value found in the register at the end of the transfer will be 0x7B.
 
==== DMA data dumps ====
To further investigate and understand how PI DMA is performed, the repo [https://github.com/rasky/n64_pi_dma_test n64_pi_dma_test] can be used. The repo contains data dumps acquires on real hardware of DMA transfers with all possible misalignments (0, 2, 4, 6), all lengths from 1 to 384 bytes, and all distances from RDRAM page end from 0 to 128 bytes. It also contains timing information on all those transfers. The repo can be used as a testsuite for emulators, but also to further investigate other side cases.