Lion Cove: Intel's P-Core Roars

109 pointsposted 16 hours ago
by luyu_wu

47 Comments

kristianp

10 hours ago

About 94.9 GB/s DRAM bandwidth for the Core Ultra 7 258V they measured. Aren't Intel going to respond to the 200GB/s bandwidth of the M1 Pro introduced 3 years ago? Not to mention 400GB/s of Max and 800GB/s of the Ultra?

Most of the bandwidth comes from cache hits, but for those rare workloads larger than the caches, Apples products may be 2-8x faster?

adrian_b

6 hours ago

AMD Strix Halo, to be launched in early 2025, will have a 256-bit memory interface for LPDDR5x of 8 or 8.5 GHz, so it will match M1 Pro.

However, Strix Halo, which has a much bigger GPU, is designed for a maximum power consumption for CPU+GPU of 55 W or more (up to 120 W), while Lunar Lake is designed for 17 W, which explains the choices for the memory interfaces.

kvemkon

an hour ago

> LPDDR5x of 8 or 8.5 GHz

Not 8000 or 8500 MT/s and thus the frequency is halved?

Dylan16807

5 hours ago

That's good. And better than match, that's 30% faster, at least until the M4 Pro launches with a RAM frequency upgrade.

On the other hand, I do think it's fair to compare to the Max too, and it loses by a lot to that 512 bit bus.

wtallis

8 hours ago

Lunar Lake is very clearly a response to the M1, not its larger siblings: the core counts, packaging, and power delivery changes all line up with the M1 and successors. Lunar Lake isn't intended to scale up to the power (or price) ranges of Apple's Pro/Max chips. So this is definitely not the product where you could expect Intel to start using a wider memory bus.

And there's very little benefit to widening the memory bus past 128-bit unless you have a powerful GPU to make good use of that bandwidth. There are comparatively few consumer workloads for CPUs that are sufficiently bandwidth-hungry.

nox101

8 hours ago

with all of the local ML being introduced by Apple and Google and Microsoft this thinking seems close to "640k is all you need"

I suspect consumer workloads to rise

throwuxiytayq

6 hours ago

I think the number of people interested in running ML models locally might be greatly overestimated [here]. There is no killer app in sight that needs to run locally. People work and store their stuff in the cloud. Most people just want a lightweight laptop, and AI workloads would drain the battery and cook your eggs in a matter of minutes, assuming you can run them. Production quality models are pretty much cloud only, and I don’t think open source models, especially ones viable for local inference will close the gap anytime soon. I’d like all of those things to be different, but I think that’s just the way things are.

Of course there are enthusiasts, but I suspect that they prefer and will continue to prefer dedicated inference hardware.

0x000xca0xfe

39 minutes ago

Microsoft wants to bring Recall back. When ML models come as part of the OS suddenly there are hundreds of millions of users.

tucnak

3 hours ago

> AI workloads would drain the battery and cook your eggs in a matter of minutes, assuming you can run them

M2 Max is passively cooled... and does 1/2 of 4090's token bandwidth in inference.

Onavo

5 hours ago

> I think the number of people interested in running ML models locally might be greatly overestimated [here]. There is no killer app in sight that needs to run locally. People work and store their stuff in the cloud. Most people just want a lightweight laptop, and AI workloads would drain the battery and cook your eggs in a matter of minutes, assuming you can run them. Production quality models are pretty much cloud only, and I don’t think open source models, especially ones viable for local inference will close the gap anytime soon. I’d like all of those things to be different, but I think that’s just the way things are. Of course there are enthusiasts, but I suspect that they prefer and will continue to prefer dedicated inference hardware.

Do you use FTP instead of Dropbox?

epolanski

3 hours ago

The few reviews we have seen now show that lunar lake is competitive with m3s too depending on the application.

formerly_proven

3 hours ago

Is the full memory bandwidth actually available to the CPU on M-series CPUs? Because that would seem like a waste of silicon to me, to have 200+ GB/s of past-LLC bandwidth for eight cores or so.

wmf

9 hours ago

The "response" to those is discrete GPUs that have been available all along.

Aaargh20318

3 hours ago

Discrete GPUs are a dead end street. They are fine for gaming, but for GPGPU tasks unified memory is a game changer.

kristianp

9 hours ago

True, but I thought Intel might start using more channels to make that metric look less unbalanced in Apple's favour. Especially now that they are putting RAM on package.

tjoff

3 hours ago

Why the obsession of this particular metric? And how can one claim something is unbalanced while focusing on one metric?

sudosysgen

8 hours ago

Not really, the killer is latency, not throughput. It's very rare that a CPU actually runs out of memory bandwidth. It's much more useful for the GPU.

95GB/s is 24GB/s per core, at 4.8Ghz that's 40 bits per core per cycle. You would have to be doing basically nothing useful with the data to be able to get through that much bandwidth.

fulafel

an hour ago

40 bits per clock in a 8-wide core gets you 5 bits per instruction, and we have AVX512 instructions to feed, with operand sizes 100x that (and there are multiple operands).

Modern chips do face the memory wall. See eg here (though about Zen 5) where they in the same vein conclude "A loop that streams data from memory must do at least 340 AVX512 instructions for every 512-bit load from memory to not bottleneck on memory bandwidth."

unsigner

6 hours ago

There might be a chicken-and-egg situation here - one often hears that there’s no point having wider SIMD vectors or more ALU units, as they would spend all their time waiting for the memory anyway.

jart

3 hours ago

The most important algorithm in the world, matrix multiplication, just does a fused multiply add on the data. Memory bandwidth is a real bottleneck.

svantana

35 minutes ago

Is it though? The matmul of two NxN matrices takes N^3 macs and 2*N^2 memory access. So the larger the matrices, the more the arithmetic dominates (with some practical caveats, obviously).

perryh2

13 hours ago

It looks awesome. I am definitely going to purchase a 14" Lunar Lake laptop from either Asus (Zenbook S14) or Lenovo (Yoga Slim). I really like my 14" MBP form factor and these look like they would be great for running Linux.

jjmarr

12 hours ago

I constantly get graphical glitches on my Zenbook Duo 2024. Would recommend against going Intel if you want to use Linux.

skavi

11 hours ago

Intel has historically been pretty great at Linux support. Especially for peripherals like WiFi cards and GPUs.

jauntywundrkind

10 hours ago

Their "PCIe" wifi cards "mysteriously" not working in anything but Intel systems is enraging.

I bought a wifi7 card & tried it in a bunch of non-Intel systems, straight up didn't work. Bought a wifi6 card and it sort of works, ish, but I have to reload the wifi module and sometimes it just dies. (And no these are not cnvio parts).

I think Intel has a great amazing legacy & does super things. Usually their driver support is amazing. But these wifi cards have been utterly enraging & far below what's acceptable in the PC world; they are not fit to be called PCIe devices.

Something about wifi really brings out the worst in companies. :/

transpute

9 hours ago

> not fit to be called PCIe devices

They might be CNVi in M.2 form factor, with the rest of the "wifi card" inside the Intel SoC.

  In CNVi, the network adapter's large and usually expensive functional blocks (MAC components, memory, processor and associated logic/firmware) are moved inside the CPU and chipset (Platform Controller Hub). Only the signal processor, analog and Radio frequency (RF) functions are left on an external upgradeable CRF (Companion RF) module which, as of 2019 comes in M.2 form factor.
Wifi7 has 3-D radar features for gestures, heartbeat, keystrokes and human activity recognition, which requires the NPU inside Intel SoC. The M.2 card is only a subset.

zaptrem

7 hours ago

> Wifi7 has 3-D radar features for gestures, heartbeat, keystrokes and human activity recognition, which requires the NPU inside Intel SoC. The M.2 card is only a subset.

Source? Google turned up nothing.

zxexz

4 hours ago

Sounds like some LLM hallucination to me.

EDIT: Right after that I found another HN comment [0] by the same user (through a google search!)!

[-1] Interesting IEEE rfc email thread on related to preamble puncturing

misc (I have not yet read these through beyond the abstracts): A preprint in ArXiV related to the proposed spec [1] A paper in IEEE Xplore on 802.11bf [2] NIST publication on 802.11bf [3] (basically [2] but on NIST)

[-1] https://www.ieee802.org/11/email/stds-802-11-tgbe/msg00711.h... [0] https://news.ycombinator.com/item?id=38811036 [1] https://arxiv.org/pdf/2207.04859 [2] https://ieeexplore.ieee.org/document/10467185 [3] https://www.nist.gov/publications/ieee-80211bf-enabling-wide...

silisili

11 hours ago

I get them also in my Lunar Lake NUC. Usually in the browser, and presents as missing/choppy text oddly enough. Annoying but not really a deal breaker. Hoping it sorts out in the next couple kernel updates.

jjmarr

6 hours ago

Do you get weird checkerboard patterns as well?

gigatexal

11 hours ago

Give it some time. Probably needs updated drivers, intel and Linux have been rock solid for me too. If your hardware is really new it’s likely a kernel and time issue. 6.12 or 6.13 should have everything sorted.

rafaelmn

4 hours ago

Given the layoffs and the trajectory spiral I wouldn't be holding my breath for this.

amanzi

13 hours ago

I'm really curious about how well they run Linux. e.g. will the NPU work under Linux in the same way it does on Windows? Or does it require specific drivers? Same with the batter life - if there a Windows-specific driver that helps with this, or can we expect the same under Linux?

adrian_b

6 hours ago

I completely agree with the author that renaming the L1 cache memory as L0 and introducing a new L1 cache, as done by Intel is a completely misleading terminology.

The correct solution is that from the parent article, to continue to call the L1 cache memory as the L1 cache memory, because there is no important difference between it and the L1 cache memories of the previous CPUs, and to call the new cache memory that has been inserted between the L1 and L2 cache memories as the L1.5 cache memory.

Perhaps Intel did this to give the very wrong impression that the new CPUs have a bigger L1 cache memory than the old CPUs. To believe this would be incorrect, because the so called new L1 cache has a much lower throughput and a worse latency than a true L1 cache memory of any other CPU.

The new L1.5 is not a replacement for an L1 cache, but it functions as a part of the L2 cache memory, with identical throughput as the L2 cache, but with a lower latency. As explained in the article, this has been necessary to allow Intel to expand the L2 cache to 2.5 MB in Lunar Lake and to 3 MB in Arrow Lake S (desktop CPU), in comparison with AMD, which has an only 1 MB L2 cache (but a bigger L3 cache).

According to rumors, while the top AMD desktop CPUs without stacked cache memory have an 80 MB L2+L3 cache (16 MB L2 + 64 MB L3), the top Intel model 285K might have 78 MB of cache, i.e. about the same amount, but with a different distribution on levels: 2 MB L1.5 + 40 MB L2 + 36 MB L3. Nevertheless, until now there is no official information from Intel about Arrow Lake S, whose launch is expected in a month from now, so the amount of L3 cache is not certain, only the amounts of L2 and L1.5 are known from earlier Intel presentations.

Lunar Lake is an excellent design for all applications where adequate cooling is impossible, i.e. thin and light notebooks and tablets or fanless small computers.

Nevertheless, Intel could not abstain from not using unfair marketing tactics. Almost all the benchmarks presented by Intel at the launch of Lunar Lake have been based on the top model 288V. Both top models 288V and 268V are likely to be unobtainium for most computer models, while at the few manufacturers that will offer this option they will be extremely overpriced.

Most available and affordable computers with Lunar Lake will not offer any better CPU than 258V, which is the one tested in the parent article. 258V has only 4.8 GHz/2.2 GHz turbo/base clock frequencies, vs. 5.1 GHz/3.3 GHz of the 288V used in the Intel benchmarks and in many other online benchmarks. So the actual experience of most Lunar Lake users will not match most published benchmarks, even if it will be good enough in comparison with any competitors in the same low-power market segment.

RicoElectrico

13 hours ago

> A plain memory latency test sees about 131.4 ns of DRAM latency. Creating some artificial bandwidth load drops latency to 112.4 ns.

Can someone put this in context? The values seem order of magnitude higher than here: https://www.anandtech.com/show/16143/insights-into-ddr5-subt...

toast0

12 hours ago

The chips and cheese number feels like an all-in number; get a timestamp, do a memory read (that you know will not be served from cache), get another timestamp.

The anandtech article is latencies for parts of a memory operation, between the memory controller and the ram. End to end latency is going to be a lot more than just CAS latency, because CAS latency only applies once you've got the proper row open, etc.

wtallis

11 hours ago

Getting requests up through the cache hierarchy to the DRAM controller, and data back down to the requesting core's load/store units is also a non-trivial part of this total latency.

jart

3 hours ago

Use Intel's mlc (memory latency checker) tool to measure your system. On a GCE instance I see about 97ns for RAM access. On a highly overclocked gaming computer with a small amount of RAM I see 60ns. Under load, latency usually drops to about 200ns. On workstation with a lot of RAM and cores I see it drop to a microsecond.

foota

13 hours ago

I think the numbers in that article (the CAS latency) are the latency numbers "within" the DRAM module itself, not the end to end latency between the processor and the RAM.

You could read the article on the latest AMD top of the line desktop chip to compare: https://chipsandcheese.com/2024/08/14/amds-ryzen-9950x-zen-5... (although that's a desktop chip, the original article compares the Intel performance to 128 ns of DRAM latency for AMD's mobile platform Strix Point)

Tuna-Fish

an hour ago

CAS latency is only the latency of doing an access from an open row. This is in no way representative of a normal random access latency. (Because caches are so large that if you were frequently hitting open rows, you'd just load from cache instead.)

The way CAS has been widely understood as "memory latency" is just wrong.

AzzyHN

11 hours ago

We'll have to see how this compared to Zen 5 once 24H2 drops.

And once more than like three Zen 5 laptops come out.

deaddodo

11 hours ago

The last couple of generations have had plenty of AMD options. Razer 14, Zephyrus G14, TUFbook, etc. If you get out of the performance/enthusiast segment, they're even more plentiful (Inspirons, Lenovos, Zenbooks, etc).

nahnahno

11 hours ago

the review guide had everyone on 24H2; there were some issues with one of the updates that messed up performance for lunar lake pre-release, but appears to have been fixed in time for release.

I’d expect lunar lake’s position to improve a bit in coming months as they tweak scheduling, but AMD should be good at this point.

Edit: around 16 mark, https://youtu.be/5OGogMfH5pU?si=ILhVwWFEJlcA3HLO. The laptops came with 24H2