8 Integer ALUs, 4 Vector FPUs, 8x L1 d-caches but only 4x L2 d-Caches.
And perhaps most importantly: 4x decoders/4x L1 iCache. IIRC, the entire damn chip was decoder-bound.
--------
Note: AMD Zen has 4x Integer pipelines and 4x FPU pipelines __PER CORE__. Modern high-performance systems CANNOT have a single 2x-pipeline FPU shared between two cores (averaging one pipeline per core). Modern Zen is closer to 4x pipelines per core, maybe more depending on how you count load/store units.
[deleted]
Yup. The limited decoders meant your pipeline just wasn’t flowing every cycle, because many of the stages were sitting idle.
Note that Intel's modern e-Core has 3x decoders per core. When code is straight, they alternate (decoder#1 / decoder#2 / decoder#3). When code is branchy, they split up across different jumps aka if/else statements.
Shrinking the decoder on Bulldozer was clearly the wrong move for Fx-series / AMD. Today's chips are going wide decoder (ex: Apple can do 8x decode per clock tick), deep opcode cache (AMD Zen has a large opcode cache allowing for 6x way lookup per clocktick), or Intel's new and interesting multiple-decoder thing.
How do you know the behavior of the decoding portion of Intel's E-core's? Do you work for them?
People use clever code to tease out microarchitectural details and scour through public information to with these things out. Agner Fog is one example. His microarch analysis documents 3x decoders for the Tremont microarch, predecessor to gracemont (what's currently used for E-cores).
https://www.agner.org/optimize/microarchitecture.pdf
The architectures of Intel cores is widely discussed and publicized. Here are the some details for the e-cores mentioned: https://chipsandcheese.com/p/skymont-intels-e-cores-reach-fo...
> Leapfrogging fetch and decode clusters have been a distinguishing feature of Intel’s E-Core line ever since Tremont. Skymont doubles down by adding another decode cluster, for a total of three clusters capable of decoding a total of nine instructions per cycle.
Intel tells you this in their optimization manuals and white papers.
They want you to write code that takes advantage of their speedups. Agner Fog is a better writer (a sibling comment already linked to Agner Fogs stuff). But I also like referencing the official manuals and whitepapers as a primary source document.
Hard to beat Intels documents on Intel chips after all.
I had a few FX cores (and I keep yet stored). The early cheap 4 cores and the latter generation 8 cores (FX 8370E).
And I can say that if you run code that scales well with multiple CPUs, it excels at it ( I can share a n-problem simalutor that I used as benchmark back in the day)
Even, they aged far better than some Intel cpus of the time, because they had 8 cores.
FX cores had his issues. But one, was the AMD bet too early, and too hard that the future was to have a high number of cores.
Problem was that even for multithreaded workloads the "8 core" FX-8150 did not always win against 4 hyperthreaded Intel cores. That is pretty apparent from e.g. the benchmarks here: https://www.phoronix.com/review/intel_corei7_3770k
You can easily see the multithreaded workloads there because you have the six core 3960X as comparison too.