On the PS2 there was a very small memory area, called the scratchpad, that was very quick to access, the rough idea on the PS2 was to DMA data in and out of the scratch pad, and then do work in the data, without creating contention with everything else going on at the same time.
In general most developers struggled to do much with it, it was just too small (combined with the fiddlyness of using it).
PS2 programmer's were very used to thinking in this way as it's how the rendering had to be done. There is a couple of vector units, and one of them is connected to the GPU, so the general structure most developers followed was to have 4 buffers in the VU memory (I think it only had 16kb of memory or something pretty small), but essentially in parallel you'd have:
1. New data being DMAd in from main memory to VU memory (into say buffer 1/4).
2. Previous data in buffer 3/4 being transformed, lit, coloured, etc and output into buffer 4/4.
3. Data from buffer 2/4 being sent/rendered by the GPU.
Then once the above had finished it would flip, so you'd alternate like:
Data in: B1 (main memory to VU)
Data out: B2 (VU to GPU)
Data process from: B3 (VU processing)
Data process to: B4 (VU processing)
Data in: B3
Data out: B4
Data process from: B1
Data process to: B2
The VU has two pipelines running in parallel (float and integer), and every instruction had an exact number of cycles to process, if you read a result before it is ready you stall the pipeline, so you had to painstakingly interleave and order your instructions to process three verts at a time and be very clever about register pressure etc.
There is obviously some clever syncing logic to allow all of this to work, allowing the DMA to wait until the VU kicks off the next GPU batch etc.
It was complex to get your head around, set up all the moving parts and debug when it goes wrong. When it goes wrong it pretty much just hangs, so you had to write a lot of validators. On PS2 you basically spend the frame building up a huge DMA list, and then at the end of the frame kick it off and it renders everything, so the DMA will transfer VU programs to the VU, upload data to the VU, wait for it to process and upload next batch, at the end upload next program, upload settings to GPU registers, bacially everything. Once that DMA is kicked off no more CPU code is involved in rendering the frame, so you have a MB or so of pure memory transfer instructions firing off, if any of them are wrong you are in a world of pain.
Then throw in, just to keep things interesting, the fact that anything you write to memory is likely stuck in caches, and DMA doesn't seem caches, so extra care has to be taken to make sure caches are flushed before using DMA.
It was a magical, horrible, wonderful, painful, joyous, impossible, satisfying, sickening, amazing time.