Understanding the Hardware Interface
The SNES contains two internal address buses: the A-Bus and the B-Bus. The A-Bus is a 24-bit bus that connects the cartridge, CPU, and WRAM. The B-Bus is an 8-bit bus that connects the cartridge, CPU, WRAM, S-PPU, and audio CPU. The B-Bus is exposed via the expansion connector on the bottom of the SNES, and it is that bus that peripherals interface with. Peripherals memory-map their control registers to addresses on the B-Bus. If the CPU wishes to interact with the peripheral, it just reads/writes from/to the appropriate addresses on the B-Bus to do so.
The master clock of the NTSC SNES runs at ~21.477 MHz, while the PAL SNES runs at ~21.281 MHz. A read or write on the B-Bus to a peripheral takes 6 master cycles, or about ~465 ns for NTSC and ~470 ns for PAL. Because the NTSC time interval is shorter, BeagleSatella uses that when determining its timing deadlines. In reality, the SNES CPU only gives the peripheral 2 master cycles (~155 ns) to respond to a request. The SNES CPU needs to use 2 master cycles to setup the request, 2 more for the peripheral to respond (our 155 ns window), and then the last 2 cycles to collect data from the bus and do something useful with it. Such a short interval doesn't give a peripheral much time to respond to a read/write request.
To meet this timing constraint, a peripheral must accomplish each of these tasks within 155 ns:
That is a lot of things to accomplish in only 155 ns, particularly for a read operation that has a strict deadline for putting the appropriate data onto the data bus.
The SNES provides two control signals for address bus activity: /PARD and /PAWR. These are active-low signals that indicate to any devices present on the bus when a CPU bus read/write request is active. In the image to the left, the yellow top signal is an active /PARD captured on an oscilloscope. The width of this active pulse is ~140 ns, which is even less than the calculated ~155 ns. This brings up another point: theoretical timing calculations are all and good, but empirical measurements with logic analyzers and scopes with the actual physical system tell the true story.
BeagleSatella uses a CPLD (a socketed 44-pin PLCC Atmel ATF1504AS-10) to monitor /PAWR, /PARD, and the address currently on the B-Bus. The CPLD acts as a filter that only lets BeagleSatella-related bus activity through to the BeagleBone Black (BBB). If the address on the B-Bus belongs to one of BeagleSatella's memory-mapped I/O (MMIO) registers, then /PAWR, /PARD, and the address on the B-Bus are passed through the CPLD and a 5V-to-3.3V line-level converter to the BBB's two programmable real-time units (PRUs). Due to limitations on the number of BBB PRU input GPIOs available, the CPLD uses a lookup table to reduce the 8-bit B-Bus address space to a 6-bit address bus (still allowing for a potential 64 bytes of MMIO registers in a BeagleSatella peripheral). The CPLD also inverts the /PAWR and /PARD signals when passing them through to the PRUs. The inversion takes advantage of an optimization within the PRU instructions to save a few cycles.
This CPLD filtering mechanism works quite well, since the CPLD is able to filter the address and pass it along through the line-level converter with a latency of only 25 ns. This leaves 115 ns for BeagleSatella to service the read/write request. You can review the CPLD CUPL source for implementation details and a visualization of the CPLD pin mapping in the generated "fit" file.
Understanding the BeagleSatella's PRU Firmware
The BBB's two PRU (PRU0 and PRU1) are a pair of asymmetrical coprocessors that are clocked at 200 MHz. Firmware running on a PRU runs bare-metal, and most instructions (including sampling the PRU-enhanced GPIOs) can be executed in a single 5 ns cycle. Unfortunately, the number of PRU-enhanced GPIOs are very limited for external interfacing via pin multiplexing. Even worse, these pins aren't bi-directional. Even more worse, these enhanced GPIOs are tied to specific PRUs. BeagleSatella uses all 24 available enhanced GPIOs to accomplish it's interfacing:
The diagram above shows the pins used on the BBB to handle BeagleSatella signals. Any signals marked in blue are outputs leaving the BBB, and any signals marked in green are inputs entering the BBB. Any unmarked pins are assumed to have the default pin multiplexing of the BBB. GND, 3.3V VCC, and 5.0V VCC are also all used, though those pins are always available and cannot be changed via multiplexing. One important thing to note is that two signals are mapped to both P9.41 and P9.42. In order to use these pins, the PRU-friendly signals (internally called "P9_91" and "P9_92" for P9.41 and P9.42, respectively) are mapped to mode 5 per the table above. The other signals (internally called "P9_41" and "P9_42") are mapped as mode 7 (GPIO) inputs to avoid having them output signals on the same pins. For the ground truth on exactly how everything is muxed, review the BeagleSatella device tree overlay.
Once the signals have reached the BBB's PRUs, the PRUs have to act quickly to meet the deadline. PRU0 constantly spins on the passed-through (and inverted) PARD and PAWR signals. Once either PARD or PAWR goes high, PRU0 signals PRU1 via an interrupt to have PRU1 sample the filtered address lines and B-Bus data. PRU1 samples these lines and shoves the data into a register shared between PRU0 and PRU1. It then sends an interrupt to PRU0 to notify it that the data is available.
At this point, PRU0 jumps into either a read or write vector table, depending upon whether the current access is a read or write. The filtered register address provides the offset within the table to jump to. PRU0 gets the address of a peripheral register-specific handler from the vector table and jumps into it to handle the request. For writes, this is very simple: store the B-Bus data into a location associated with the register. For reads, PRU0 must quickly access the appropriate data and then place it onto the B-Bus via the appropriate output GPIOs.
The output GPIOs pass through a tri-stated 3.3V-to-5V line-level converter to the SNES's data bus. The converter only places data on the SNES's data bus when the filtered PARD signal is active, so BeagleSatella is in a high-impedence state and tri-stated off of the B-Bus when no BeagleSatella read is active.
It typically takes 80-100 ns to perform the coordination between the PRUs, sample the GPIOs, perform the register operation, and output data through the latency of the line-level converter (on a read). The important part is that BeagleSatella is able to read and write on the SNES data bus within the deadlines required to interface with real SNES hardware.
Understanding BeagleSatella's Kernel Driver
So, that handles the hardware interfacing. But, does the peripheral do anything useful? After all, if you aren't giving the SNES useful data, what was the point of all of this? Unlike a desktop PC running an emulator, we can't play fast and loose with the timing of our peripheral. If we're doing any processing, calculations, data collection/storage, or hardware interfacing, we need to do it in the time periods between servicing B-Bus read/write requests.