How much memory do you need in 2024 to run 1M concurrent tasks?

243 pointsposted 15 hours ago
by neonsunset

182 Comments

AkshitGarg

13 hours ago

I feel this benchmark compares apples to oranges in some cases.

For example, for node, the author puts a million promises into the runtime event loop and uses `Promise.all` to wait for them all.

This is very different from, say, the Go version where the author creates a million goroutines and puts `waitgroup.Done` as a defer call.

While this might be the idiomatic way of concurrency in the respective languages, it does not account for how goroutines are fundamentally different from promises, and how the runtime does things differently. For JS, there's a single event loop. Counting the JS execution threads, the event loop thread and whatever else the runtime uses for async I/O, the execution model is fundamentally different from Go. Go (if not using `GOMAXPROCS`) spawns an OS thread for every physical thread that your machine has, and then uses a userspace scheduler to distribute goroutines to those threads. It may spawn more OS threads to account for OS threads sleeping on syscalls. Although I don't think the runtime will spawn extra threads in this case.

It also depends on what the "concurrent tasks" (I know, concurrency != parallelism) are. Tasks such as reading a file or doing a network call are better done with something like promises, but CPU-bound tasks are better done with goroutines or Node worker_threads. It would be interesting to see how the memory usage changes when doing async I/O vs CPU-bound tasks concurrently in different languages.

n2d4

12 hours ago

Actually, I think this benchmark did the right thing, that I wish more benchmarks would do. I'm much less interested in what the differences between compilers are than in what the actual output will be if I ask a professional Go or Node.js dev to solve the same task. (TBF, it would've been better if the task benchmarked was something useful, eg. handling an HTTP request.)

Go heavily encourages a certain kind of programming; JavaScript heavily encourages a different kind; and the article does a great job at showing what the consequences are.

rtpg

10 hours ago

But you wouldn't call a million tasks with `Promise.all` in Node, right? That's just not a thing that one does.

Instead, there's usually going to be some queue outside the VM that will leave you with _some_ sort of chunking and otherwise working in smaller, more manageable bits (that might, incidentally, be shaped in ways that the VM can handle in interesting ways).

It's definitely true to say that the "idioamatic" way of handling things is worth going into, but if part of your synthetic benchmark involves doing something quite out of the ordinary, it feels suspicious.

I generally agree that a "real" benchmark here would be nice. It would be interesting if someone could come up with the "minimum viable non-trivial business logic" that people could use for these benchmarks (perhaps coupled with automation tooling to run the benchmarks)

hamandcheese

10 hours ago

> But you wouldn't call a million tasks with `Promise.all` in Node, right? That's just not a thing that one does.

But neither would you wait on a waitgroup of size 1 million in Go... right?

ricardobeat

9 hours ago

You could, and the tasks would run concurrently. Node is single threaded so unless you used one of the I/O calls backed by a thread pool, they would all execute sequentially .

rfoo

8 hours ago

For a goroutine doing nothing, no.

But if I have 1 million tasks which spent 10% of their time on CPU-bound codes, intermixed with other IO-bound codes, and I just want throughput and I'm too lazy to use a proper task queue, then why not?

rtpg

10 hours ago

yeah, right? I mean I don't have a dog in this race, just wished we could get into "normal" repros without having to wonder if some magic is kicking in

Quothling

12 hours ago

> Go heavily encourages a certain kind of programming;

True, but it really doesn't encourage you to run 1m goroutines with the standard memory setting. Though it's probably fair to run Go wastefully when you're comparing it to Promise.All.

n2d4

12 hours ago

Of course! That's why the article is telling you that some languages (C#, Rust) are better at it than others (Go, Java). Doesn't mean that Go and Java are bad languages! Just that they aren't good to do this thing.

Quothling

11 hours ago

The article is telling us that you can run really inefficient code. Goroutines should be run with worker pools and a buffered channel and it's silly to not do that and then compare it to things like an optimized Rust crate like Tokio.

Aeolun

11 hours ago

Is that the ideomatic way to do it, or the best way you can imagine?

Quothling

11 hours ago

> Is that the ideomatic way to do it

Well... I'm actually not sure what ideomatic means (English isn't my first language), but it's the standard way of doing it. You'll even find it as step 2 and 3 here: https://go.dev/tour/concurrency/1

> or the best way you can imagine

I would do a lot much more to tune it if you were in a position where you'd know it would run that many "tasks". I think what many non-Go programmers might run into here is that Go doesn't come with any sort of "magic". Instead it comes with a highly opinionated way of doing things. Compare that to C# which comes with a highly optimized CLR and a bunch really excellent libraries which are continuously optimized by Microsoft and you're going to end up with an article like this. The async libraries are maintaining which tasks are running (though Promise.All is obviously also binding a huge amount of memory you don't have to), while the Go example is running 1 million at once.

You'll also notice that there is no benchmark for execution time. With Go you might actually want to pay with memory, though I'd argue that you'd almost never want to run 1 million Goroutines at once.

Though to be fair to this specific author, it looks like they copied the previous benchmarks and then ran it as-is.

guitarbill

11 hours ago

The post was edited, previously it just said roughly this part: "step 2 and 3 here: https://go.dev/tour/concurrency/1". Which - as far as I can tell - does not mention worker pools...

Quothling

10 hours ago

You're right. It is using channels and buffers, but you're right.

It's not part of the actual documentation either, at least not exactly: https://go.dev/doc/effective_go#concurrency You will achieve much the same if you follow it, but my answer should have been yes and no as far as being the "standard" Go way.

pbhjpbhj

10 hours ago

Idiomatic is the word the parent was looking for. The base word is idiom.

It was probably the intent of the parent to mean 'making use of the particular features of the language that are not necessarily common to other languages'.

I'm not a programmer, but you appear to give good examples.

I hope I'm not teaching you to suck eggs... {That's an idiom, meaning teaching someone something they're already expert in. Like teaching your Grandma to suck eggs - which weirdly means blowing out the insides of a raw egg. That's done when using the egg to paint; which is a traditional Easter craft.}

Quothling

10 hours ago

I actually did find "idiomatic" when I looked it up, but I honestly still didn't quite grasp it from the cambridge dictionary. Thanks for explaining it in a way I understand.

jchw

11 hours ago

I'm torn.

As far as practicality goes I actually agree with you: if I knew I were trying to do something to the order of 1,000,000 tasks in Go I would probably use a worker pool for this exact reason. I have done this pattern in Go. It is certainly not unidiomatic.

However, it also isn't the obvious way to do 1,000,000 things concurrently in Go. The obvious way to do 1,000,000 things concurrently in Go is to do a for loop and launch a Goroutine for each thing. It is the native unit of task. It is very tightly tied to how I/O works in Go.

If you are trying to do something like a web server, then the calculus changes a lot. In Go, due to the way I/O works, you really can't do much but have a goroutine or two per connection. However, on the other hand, the overhead that goroutines imply starts to look a lot smaller once you put real workloads on each of the millions of tasks.

This benchmark really does tell you something about the performance and overhead of the Go programming language, but it won't necessarily translate to production workloads the way that it seems like it will. In real workloads where the tasks themselves are usually a lot heavier than the constant cost per task, I actually suspect other issues with Go are likely to crop up first (especially in performance critical contexts, latency.) So realistically, it would probably be a bad idea to extrapolate from a benchmark this synthetic to try to determine anything about real world workloads.

Ultimately though, for whatever purpose a synthetic benchmark like this does serve, I think they did the correct thing. I guess I just wonder exactly what the point of it is. Like, the optimized Rust example uses around 0.12 KiB per task. That's extremely cool, but where in the real world are you going to find tasks where the actual state doesn't completely eclipse that metric? Meanwhile, Go is using around 2.64 KiB per task. 22x larger than Rust as it may be, it's still not very much. I think for most real world cases, you would struggle to find too many tasks where the working set per task is actually that small. Of course, if you do, then I'd reckon optimized async Rust will be a true barn-burner at the task, and a lot of those cases where every byte and millisecond counts, Go does often lose. There are many examples.[1]

In many cases Go is far from optimal: Channels, goroutines, the regex engine, various codec implementations in the standard library, etc. are all far from the most optimal implementation you could imagine. However, I feel like they usually do a good job making the performance very sufficient for a wide range of real world tasks. They have made some tradeoffs that a lot of us find very practical and sensible and it makes Go feel like a language you can usually depend on. I think this is especially true in a world where it was already fine when you can run huge websites on Python + Django and other stacks that are relatively much less efficient in memory and CPU usage than Go.

I'll tell you what this benchmark tells me really though: C# is seriously impressive.

[1]: https://discord.com/blog/why-discord-is-switching-from-go-to...

Quothling

10 hours ago

I agree with everything you said and I think you contributed a lot to what I said making things much more clear.

> I'll tell you what this benchmark tells me really though: C# is seriously impressive.

The C# team has done some really great work in recent years. I personally hate working with it and it's "magic", but it's certainly in a very good place as far as trusting the CLR to "just work".

Hilariously I also found the Python benchmark to be rather impressive. I was expecting much worse. Not knowing Python well enough, however, makes it hard to really "trust" the benchmark. A talented Python team might be capable of reducing memory usage as much as following every step of the Go concurrency tour would for Go.

neonsunset

10 hours ago

Userspace scheduling of Goroutines, virtual stack and non-deterministic pointer type allocation in Go are as much magic if not more, the syntactic sugar of C# is there to get the language out of your way and usually comes at no cost :)

If you do not like the aesthetics of C# and find Elixir or OCaml family tolerable - perhaps try F#? If you use task CEs there you end up with roughly the same performance profile and get to access huge ecosystem making it one of the few FP languages that can be used in production with minimal risk.

jonathanstrange

8 hours ago

No professional Go programmer would spawn 1M goroutines unless they're sure they have the memory for it (and even then, only if benchmarks indicate it, which is unlikely). Goroutines have a static stack overhead between 2KiB to 8KiB depending on the platform. You'd use a work stealing approach with a reasonable number of goroutines instead. How many are reasonable needs to be tested because it depends on how long each Goroutine spends waiting for I/O or sleeping.

But I can go further than that: No professional programmer should run 1M concurrent tasks on an ordinary CPU no matter which language because it makes no sense if the CPU has several orders of magnitudes less cores. The tasks are not going to run in parallel anyway.

OutOfHere

7 hours ago

The basis for running 1 million concurrent tasks is to support 1 million active concurrent user connections. They don't need to run in parallel if async is used. As shown, Rust and C# do well. How would you support it in Go?

jonathanstrange

4 hours ago

The servers I use have limits far below 1M active connections, realistically speaking about 60k simultaneously active connections. So I can't really answer that question. However, it's easy to find answers to that question online [1]. Go is not forcing you to spawn Goroutines when you don't really need them. As I said, the correct way in Go is to use worker pools, the size of which depends on measurable performance because it is connected to how much i/o each Goroutine performs and how long it waits on average.

[1] https://www.freecodecamp.org/news/million-websockets-and-go-...

YetAnotherNick

11 hours ago

The fundamental problem is there are two kind of sleep function. One that actually sleeps and other that is a actually a timer that just calls certain callback after a desired interval. Promise is just a syntactic sugar on top of second type. Go certainly could call another function after desired interval using `Timer`.

I think better comparison would be wasting CPU for 10 seconds instead of sleep.

SPascareli13

13 hours ago

As far as I know there is no way to do Promise like async in go, you HAVE to create a goroutine for each concurrent async task. If this is really the case then I believe the submition is valid.

But I do think that spawning a goroutine just to do a non-blocking task and get its return is kinda wasteful.

n2d4

12 hours ago

You could in theory create your own event loop and then get the exact same behaviour as Promises in Go, but you probably shouldn't. Goroutines are the way to do this in Go, and it wouldn't be useful to benchmark code that would never be written in real life.

peterhon

11 hours ago

I guess what you can do in golang that would be very similar to the rust impl would be this (and could be helpful even in real life, if all you need is a whole lot of timers):

  func test2(count int) {
  
   timers := make([]*time.Timer,count)
   for idx, _ := range timers {
    timers[idx] = time.NewTimer(10 * time.Second)
   }
   for idx, _ := range timers {
    <-timers[idx].C
   }
  }
This yields to 263552 Maximum resident set size (kbytes) according to /usr/bin/time -v

I'm not sure if I missed it, but I don't see the benchmark specify how the memory was measured, so I assumed the time -v.

threeseed

12 hours ago

The requirement is to run 1 million concurrent tasks.

Of course each language will have a different way of achieving this task each of which will have their unique pros/cons. That's why we have these different languages to begin with.

jakewins

12 hours ago

The accounting here is weird though; Go isn’t using that RAM, it’s expecting the application to. The reason that doesn’t happen is because it’s a micro benchmark that produces no useful work..

The way the results are presented a reader may think the Go memory usage sounds equivalent to the others - boilerplate, ticket-to-play - and then the Go usage sounds super high.

But they are not the same; that memory is in anticipation of a real world program using it

Aeolun

11 hours ago

Isn’t that kind of dumb when none of the other languages do this? Apparently allocating memory is really fast? Maybe we should change the test to load 1MB of data in every task?

melodyogonna

11 hours ago

Most of those languages (excepting Java virtual threads) uses stackless coroutines. Go uses stackful coroutines which allocates some memory upfront for a goroutine to use

gpderetta

10 hours ago

Then it is fair to compare the memory usage of a stackful coroutine to a stack less one as they are the idiomatic way to perform async task on each language.

jakewins

9 hours ago

I mean this is subjective, but as long as it’s clear that one number is “this is the memory the runtime itself consumes to solve this problem” and the other number is “this is the runtime memory use and it includes pre-allocated stack space that a real application would then use”, sure

Point being: Someone reading this to choose which runtime will fit their use case needs to be carefully to not assume the numbers measure the same thing. For some real world use cases the pre allocated stack will perform better than the runtimes that instead will do heap allocations.

gpderetta

8 hours ago

Of course, as any microbenchmark, the bare results are useless. The numbers can be interesting only if you take the time to understand the implications.

perryizgr8

11 hours ago

> Apparently allocating memory is really fast?

Apart from i/o, allocating memory is usually the slowest thing you can do on a computer, in my experience.

user

11 hours ago

[deleted]

lmm

11 hours ago

> The requirement is to run 1 million concurrent tasks.

That's not a real requirement though. No business actually needs to run 1 million concurrent tasks with no concern for what's in them.

OutOfHere

7 hours ago

If you want to support 1 million concurrent active users, you can need it.

lmm

3 hours ago

Maybe. But in that case you will need to do something for each of those users, and which languages are good at that might look quite different from this benchmark.

gleenn

11 hours ago

Also, for Java, Virtual Threads are a very new feature (Java 21 IIRC or somewhere around there). OS threads have been around for decades. As a heavy JVM user it would have been nice to actually see those both broken out to compare as well!

xargon7

9 hours ago

There's a difference between "running a task that waits for 10 seconds" and "scheduling a wakeup in 10 seconds".

The code for several of the languages that are low-memory usage that do the second while the high memory usage results do the first. For example, on my machine the article's go code uses 2.5GB of memory but the following code uses only 124MB. That difference is in-line with the rust results.

  package main
  
  import (
    "os"
    "strconv"
    "sync"
    "time"
  )
  
  func main() {
    numRoutines, _ := strconv.Atoi(os.Args[1])
    var wg sync.WaitGroup
    for i := 0; i < numRoutines; i++ {
      wg.Add(1)
      time.AfterFunc(10*time.Second, wg.Done)
    }
    wg.Wait()
  }

mrighele

8 hours ago

I agree with you. Even something as simple as a loop like (pseudocode)

for (n=0;n<10;n++) { sleep(1 second); }

Changes the results quite a bit: for some reasons java use a _lot_ more memory and takes longer (~20 seconds), C# uses more that 1GB of memory, while python struggles with just scheduling all those tasks and takes more than one minute (beside taking more memory). node.js seems unfazed by this change.

I think this would be a more reasonable benchmark

neonsunset

7 hours ago

Indeed, looping over a Task.Delay likely causes a lot of churn in timer queues - that's 10M timers allocated and scheduled! If it is replaced with 'PeriodicTimer', the end result becomes more reasonable.

This (AOT-compiled) F# implementation peaks at 566 MB with WKS GC and 509 MB with SRV GC:

    open System
    open System.Threading
    open System.Threading.Tasks

    let argv = Environment.GetCommandLineArgs()

    [1..int argv[1]]
    |> Seq.map (fun _ ->
        task {
            let timer = PeriodicTimer(TimeSpan.FromSeconds 1.0)
            let mutable count = 10
            while! timer.WaitForNextTickAsync() do
                count <- count - 1
                if count = 0 then timer.Dispose()
        } :> Task)
    |> Task.WaitAll
To Go's credit, it remains at consistent 2.53 GB and consumes quite a bit less CPU.

We're really spoiled with choice these days in compiled languages. It takes 1M coroutines to push the runtime and even at 100k the impact is easily tolerable, which is far more than regular applications would see. At 100K .NET consumes ~57 MB and Go consumes ~264 MB (and wins at CPU by up to 2x).

neonsunset

9 hours ago

Spawning a periodically waking up Task in .NET (say every 250ms) that performs work like sending out a network request would retain comparable memory usage (in terms of async overhead itself).

Even at 100k tasks the bottleneck is going to be the network stack (sending outgoing 400k RPS takes a lot of CPU and syscall overhead, even with SocketAsyncEngine!).

Doing so in Go would require either spawning Goroutines, or performing scheduling by hand or through some form of aggregation over channel readers. Something that Tasks make immediately available.

The concurrency primitive overhead becomes more important if you want to quickly interleave multiple operations at once. In .NET you simply do not await them at callsite until you need their result later - this post showcases how low the overhead of doing so is.

piterrro

10 hours ago

I don't know what's a fair way to do this for all languages listed in the benchmark, but for Go vs Node the only fair way would be to use a single goroutine to schedule timers and another one to pick them up when they tick, this way we don't create a huge stack and it's much more comparable to what you're really doing in Node.

Consider the following code:

package main

import ( "fmt" "os" "strconv" "time" )

func main() {

    numTimers, _ := strconv.Atoi(os.Args[1])

    timerChan := make(chan struct{})

    // Goroutine 1: Schedule timers
    go func() {
        for i := 0; i < numTimers; i++ {
            timer := time.NewTimer(10 * time.Second)
            go func(t *time.Timer) {
                <-t.C
                timerChan <- struct{}{}
            }(timer)
        }
    }()

    // Goroutine 2: Receive and process timer signals
    for i := 0; i < numTimers; i++ {
        <-timerChan
    }
}

Also for Node it's weird not to have Bun and Deno included. I suppose you can have other runtimes for other languages too.

In the end I think this benchmark is comparing different things and not really useful for anything...

theamk

13 hours ago

> high number of concurrent tasks can consume a significant amount of memory

note absolute numbers here: in the worst case, 1M tasks consumed 2.7 GB of RAM, with ~2700 bytes overhead per task. That'd still fit in the cheapest server with room to spare.

My conclusion would be opposite: as long as per-task data is more than a few KB, the memory overhead of task scheduler is negligible.

pkulak

11 hours ago

Except it’s more than that. Go and Java maintain a stack for every virtual thread. They are clever about it, but it’s very possible that doing anything more than a sleep would have blown up memory on those two systems.

bilbo0s

10 hours ago

I have a sneaky suspicion if you do anything other than the sleep during these 1 million tasks, you'll blow up memory on all of these systems.

That's kind of the Achille's Heel of the benchmark. Any business needing to spawn 1 million tasks, certainly wants to do something on them. It's the "do something on them" part that usually leads to difficulties for these things. Not really the "spawn a million tasks" part.

vlovich123

2 hours ago

The “do something” OP is referring to is simple things like a deeply nested set of function calls and on stack data structures allocated and freed before you sleep. This increases the size of the stack that Go needs to save. By comparison stackless coroutines only save enough information for the continuation, no more no less. That’s going to be strictly smaller than saving the entire stack. The argument you seem to be making is that that could be the same size as the stack (eg heap allocations) but I think that’s being unreasonably optimistic. It should always end up being strictly smaller.

cperciva

13 hours ago

This depends a lot on how you define "concurrent tasks", but the article provides a definition:

Let's launch N concurrent tasks, where each task waits for 10 seconds and then the program exists after all tasks finish. The number of tasks is controlled by the command line argument.

Leaving aside semantics like "since the tasks aren't specified as doing anything with side effects, the compiler can remove them as dead code", all you really need here is a timer and a continuation for each "task" -- i.e 24 bytes on most platforms. Allowing for allocation overhead and a data structure to manage all the timers efficiently, you might use as much as double that; with some tricks (e.g. function pointer compression) you could get it down to half that.

Eyeballing the graph, it looks like the winner is around 200MB for 1M concurrent tasks, so about 4x worse than a reasonably efficient but not heavily optimized implementation would be.

I have no idea what Go is doing to get 2500 bytes per task.

masklinn

12 hours ago

> I have no idea what Go is doing to get 2500 bytes per task.

TFA creates a goroutine (green thread) for each task (using a waitgroup to synchronise them). IIRC goroutines default to 2k stacks, so that’s about right.

One could argue it’s not fair and it should be timers which would be much lighter. There’s no “efficient wait” for them but that’s essentially the same as the appendix rust program.

jakewins

12 hours ago

Fair or not, it’s a strange way to count - Go isn’t using that RAM. It’s preallocating it because any real world program will.

masklinn

11 hours ago

Go is absolutely using that ram, it’s not available to other services on the system.

xarope

9 hours ago

The argument then, is what if we DO load 2K worth [0] of randomized data into each of those 1m goroutines (and equivalents in the other languages), and do some actual processing. Would we still see the equivalent 10x (whatever math works it out to be) memory "bloat"? And what about performance?

We, as devs, have "4" such resources available to us, memory, network, I/O and compute. And it behooves us to not prematurely optimize on just one.

[0] I can see more arguments/discussions now, "2K is too low, it should be 2MB" etc...!

masklinn

9 hours ago

So the argument is “if you measure something completely different from and unrelated to the article you do not get the same result”?

I guess that’s true.

And to be clear, I do agree with the top comment (which seems to be by you), TFA uses timers in the other runtimes and go does have timers so using goroutines is unwarranted and unfair. And I said as much a few comment up (although I’d forgotten about AfterFunc so I’d have looped and waited on timer.After which would still have been a pessimisation).

And after thinking more about it the article is in also outright lying: technically it’s only measuring tasks in Go, timers are futures / awaitables but they’re not tasks: they’re not independently scheduled units of work, and are pretty much always special cased by runtimes.

jakewins

11 hours ago

You know what I mean. If this was a real world program where those million tasks actually performed work, then this stack space is available for the application to do that work.

It’s not memory that’s consumed by the runtime, it’s memory the runtime expects the program to use - it’s just that this program does no useful work.

aeturnum

10 hours ago

I am not u/masklinn - but I don't know what you mean. Doesn't the runtime consume memory by setting it aside for future use? Like what else does "using" ram mean other than claiming it for a time?

jakewins

9 hours ago

If the example was extended to, say, once the sleep is completed then parse and process some JSON data (simulating the sleep being a wait on some remote service), then how would memory use be affected?

In the Go number reported, the majority of the memory is the stack Go allocated for the application code anticipating processing to happen. In the Node example, the processing instead will need heap allocation.

Point being that the two numbers are different - one measures just the overhead of the runtime, the other adds the memory reserved for the app to do work.

The result then looks wasteful for Go because the benchmark.. doesn’t do anything. In a real app though, preallocating stack can often be faster than doing just-in-time heap allocation.

Not always of course! Just noting that the numbers are different things; one is runtime cost, one is runtime cost plus an optimization that assumes memory will be needed for processing after the sleep.

hatefulmoron

9 hours ago

I think he means that if the Go code had done something more useful, it would use about the same amount of memory. Compare that to another implementation, which might allocate nearly no memory when the tasks don't do anything significant but would quickly catch up to Go if they did.

flockonus

10 hours ago

I'll yield it would be interesting to have a similar benchmark but instead of sleeping - which indeed by itself is nonsense, to instead each task compute a small fib sequence, or write a small file; something like that.

winternewt

11 hours ago

Please elaborate. If each stack is 2KB then surely all of that virtual memory is committed to physical RAM, and hence is using actual memory?

scrapheap

10 hours ago

Yes and no.

If that memory isn't being used and other things need the memory then the OS will very quickly dump it into swap, and as it's never being touched the OS will never need to bring it back in to physical memory. So while it's allocated it doesn't tie up the physical RAM.

cperciva

11 hours ago

Aha, 2k stacks. I figured that stacks would be page size (or more) so 2500 seemed both too small for the thread to have a stack and too large for it to not have a stack.

2k stacks are an interesting design choice though... presumably they're packed, in which case stack overflow is a serious concern. Most threading systems will do something like allocating a single page for the stack but reserving 31 guard pages in case it needs to grow.

masklinn

11 hours ago

Goroutines being go structures, the runtime can cooperate with itself so it doesn't need to do any sort of probing: function prologues can check if there's enough stack space for its frame, and grow the stack if not.

In reality it does use a guard area (technically I think it's more of a redzone? It doesn't cause access errors and functions with known small static frames can use it without checking).

uluyol

11 hours ago

Go stacks are dynamically copied and resized. Stack overflow is not a concern.

cperciva

11 hours ago

Oh yuck. Invalidating all the pointers to the stack? That's got to be expensive.

I guess if you're already doing garbage collection moving the stack doesn't make things all that much worse though... still, yuck.

masklinn

10 hours ago

Yeah it’s the drawback, originally it used segmented stacks but that has its own issues.

And it’s probably not the worst issue because deep stacks and stack pointers will mostly be relevant for long running routines which will stabilise their stack use after a while (even if some are likely subject to threshold effects if they’re at the edge, I would not be surprised if some codebases ballasted stacks ahead of time). Also because stack pointers will get promoted to the heap if they escape so the number of stack pointers is not unlimited, and the pointer has to live downwards on the stack.

YZF

11 hours ago

A goroutine stack can grow. (EDIT: With stack copying AFAICT... so no virtual pages reserved for a stack to grow... probably some reason for this design?)

Mawr

11 hours ago

> Now Go loses by over 13 times to the winner. It also loses by over 2 times to Java, which contradicts the general perception of the JVM being a memory hog and Go being lightweight.

Well, if it isn't the classic unwavering confidence that an artificial "hello world"-like benchmark is in any way representative of real world programs.

phillipcarter

an hour ago

Yes, but also, languages like Java and C# have caught up a great deal over the past 10 years and run incredibly smoothly. Most peoples' perception of them being slow is really just from legacy tech that they encountered a long time ago, or (oof) being exposed to some terrible piece of .NET Framework code that's still running on an underprovisioned IIS server.

blixt

10 hours ago

While it’s nice to compare languages with simple idiomatic code I think it’s unfair to developers to show them the performance of an entirely empty function body and graphs with bars that focus on only one variable. It paints a picture that you can safely pick language X because it had the smaller bar.

I urge anyone making decisions from looking at these graphs to run this benchmark themselves and add two things:

- Add at least the most minimal real world task inside of these function bodies to get a better feel for how the languages use memory

- Measure the duration in addition to the memory to get a feel for the difference in scheduling between the languages

tossandthrow

10 hours ago

This urge is as old as statistics. And I dare to say that most people after reading the article in question are well prepared to use the results for what they are.

blixt

10 hours ago

I can’t say I share your optimism. I’ve seen plenty of developers point to graphs like these as a reason for why they picked a language or framework for a problem. And it comes down to the benchmark how good of a proxy it actually is for such problems. I just hope that with enough feedback the author would consider making the benchmark more nuanced to paint a picture of why these differences in languages exist (as opposed to saying which languages “lose” or “win”).

sfn42

7 hours ago

And by use the results for what they are you mean ignore them because they are completely useless?

JyB

9 hours ago

I’m still baffled that some people are bold enough to voluntarily posts those kind of most-of-the-time useless “benchmark” that will inevitably be riddled with errors. I don’t know what pushes them. In the end you look like a clown more often than not.

wiseowise

8 hours ago

The fastest way to learn truth is by posting wrong thing on the internet, or something.

enginoid

8 hours ago

Trying things casually out of curiosity isn’t harmful. I expect people understand that these kinds of blog posts aren’t rigorous science to draw foundational conclusions from.

And the errors are a feature — I learn the most from the errata!

polyrand

3 hours ago

Out of curiosity, I checked if using uvloop[0] in Python changed the numbers.

This is the code:

  # /// script
  # requires-python = ">=3.12"
  # dependencies = ["uvloop"]
  # ///
  
  import asyncio
  import sys
  
  import uvloop
  
  
  async def main(num_tasks):
      tasks = []
  
      for task_id in range(num_tasks):
          tasks.append(asyncio.sleep(10))
  
      await asyncio.gather(*tasks)
  
  
  if __name__ == "__main__":
      num_tasks = int(sys.argv[1])
      # uvloop.run(main(num_tasks))
      asyncio.run(main(num_tasks))
I ran it with 100k tasks:

  /usr/bin/time -l -p -h uv run async-memory.py 100000
On my M1 MacBook Pro, using asyncio reports (~170MB):

  170835968  maximum resident set size
Using uvloop (~204MB):

  204259328  maximum resident set size

I kept the `import uvloop` statement when just using asyncio so that both cases start in the same conditions.

[0]: https://github.com/MagicStack/uvloop/

aba_cz

11 hours ago

Regarding Java I'm pretty sure that benchmark is broken at least a little bit and testing something else as not specifying initial size for ArrayList means list of size 10 which gets resized all the time when `add()` is called, leading to big amount of unused objects needing garbage collection.

brabel

10 hours ago

Yeah that is a junior mistake... They should've pre-sized the ArrayList, or better, used an array because that's more memory efficient (and I would say would be what any decent dev would do when the size of tasks is known beforehand).

> Some folks pointed out that in Rust (tokio) it can use a loop iterating over the Vec instead of join_all to avoid the resize to the list

Right, but some folks also pointed out you should've used an array in Java in the previous blog post, 2 years ago, and you didn't do that.

And folks also pointed out Elixir shouldn't have used Task in the previous benchmark (folk here being the creator of Elixir himself): https://github.com/pkolaczk/async-runtimes-benchmarks/pull/7

sfn42

10 hours ago

The difference between an arraylist with correct initial size and an array is almost nothing. Arraylist itself is just a wrapper around an array.

brabel

6 hours ago

It can be a big difference if boxing is involved. Or if the list is very big, because all access to items in the list require casting at the bytecode level (due to type erasure).

bekantan

10 hours ago

It would indeed be better to create appropriately sized storage.

However, I don't think that underlying array is resized every time `add` is called. I'd expect that resize will happen less than 30 times for 1M adds (capacity grows geometrically with a=10 and r=1.5)

jeswin

10 hours ago

Good to see NativeAOT getting positive press.

Go won because it served a need felt by many programmers: a garbage-collected language which compiled to native code, with robust libraries supported by a large corp.

With Native AOT, C# is walking into the same space. With arguably better library selection, equivalent performance, and native code compilation. And a much more powerful, well-thought-out language - at a slight complexity cost. If you're starting a project today (with the luxury of choosing a language), you should give C# + NativeAOT a consideration.

sfn42

7 hours ago

C# is my daily driver and I'd use it for almost anything, great language. However I think "slight complexity cost" is an understatement. It's a very complex language by my standards, and they keep adding more stuff. A lot of it is just syntax sugar to do the same things in a different way, like primary constructors.

It's nice to have that stuff when you know the language, but it does make the learning curve steeper and it can be a bit annoying when working in a team.

Even after 4 years of using it professionally I still see code some times that uses obscure syntax I had no idea existed. I would describe C# as a language for experts. If you know what you're doing it's an amazing language, maybe actually the best current programming language. But learning and understanding everything is a monumental task, simpler languages like go or Java can be learned much faster.

octacat

7 hours ago

Where is erlang? Sleeping is not running, by the way. If you just sleep, in Erlang you would use a hibernated process.

I feel this is so misleading. For example, by default after spawning, Erlang would have some memory preallocated for each process, so they don't need to ask the operation system for new allocations (and if you want to shrink it, you call hibernate).

Do something more real, like message passing with one million processes or websockets. Or 1M tcp connections. Because, the moment you send messages, here is when the magic happens (and memory would grow, the delay when each message is processed would be different in different languages).

Oh, and btw, if you want to do THAT in erlang, use timer:apply_after(Time, Module, Function, Arguments). Which would not spawn an erlang process, just would put the task to the timer scheduling table.

And Elixir was in the old article, and they implemented it all wrong. Sad.

jillesvangurp

10 hours ago

Did a similar benchmark in Kotlin using co-routines.

    import kotlin.time.Duration.Companion.milliseconds
    import kotlin.time.measureTime
    import kotlinx.coroutines.async
    import kotlinx.coroutines.awaitAll
    import kotlinx.coroutines.coroutineScope
    import kotlinx.coroutines.delay
    
    suspend fun main() {
        measureTime {
            coroutineScope {
                (0..1000000).map {
                    async {
                        delay(1.milliseconds)
                    }
                }.awaitAll()
            }
        }.let { t ->
            println("Took $t")
            val runtime = Runtime.getRuntime()
    
            val maxHeapSize = runtime.maxMemory() 
            val allocatedHeapSize = runtime.totalMemory()
            val freeHeapSize = runtime.freeMemory()
    
            println("Max Heap: ${maxHeapSize / 1024 / 1024} MB")
            println("Allocated Heap: ${allocatedHeapSize / 1024 / 1024} MB")
            println("Free Heap: ${freeHeapSize / 1024 / 1024} MB")
        }
    }
This produces the following output:

   Took 1.597011084s
   Max Heap: 4096 MB
   Allocated Heap: 2238 MB
   Free Heap: 1548 MB
So whatever is needed to load classes and a million co-routines with some heap state. Of course the whole thing isn't doing any work and this isn't much of a benchmark. And of course if I run it with kotlin-js it actually ends up using promises. So, it's not going to be any better there than on the JVM.

promiseofbeans

12 hours ago

It would be nice if the author also compared different runtimes (e.g. NodeJS vs Deno, or cpython vs pypy) and core language engines (e.g. v8 vs spider monkey vs JavaScript core)

davidatbu

13 hours ago

I write (async) Rust regularly, and I don't understand how the version in the appendix doesn't take 10x1,000,000 seconds to complete. In other words, I'd have expected no concurrency to take place.

Am I wrong?

UPDATE: From the replies below, it looks like I was right about "no concurrency takes place", but I was wrong about how long it takes, because `tokio::time::sleep()` keeps track of when the future was created, (ie when `sleep()` was called) instead of when the future is first `.await`ed (which was my unsaid assumption).

claytonwramsey

13 hours ago

The implementation of `sleep` [1] decides the wake up time by when `sleep` is called, rather than when its future is polled. So the first task waits one second, then the remaining tasks see that they have already passed the wake-up time and so return instantly.

[1]: https://docs.rs/tokio/latest/tokio/time/fn.sleep.html

vbsd

11 hours ago

> because `tokio::time::sleep()` keeps track of when the future was created, (ie when `sleep()` was called) instead of when the future is first `.await`ed

I’m not a Rust programmer but I strongly suspect this updated explanation is erroneous. It’s probably more like this: start time is recorded when the task execution is started. However, the task immediately yields control back to the async loop. Then the async loop starts another task, and so on. It’s just that the async loop only returns the control to sleeping task no earlier than the moment 1s passes after the task execution was initialy started. I’d be surprised if it had anything to do with when sleep() was called.

davidatbu

9 hours ago

Someone linked the code in another comment, and the start time is most definitely recorded when the future is created: https://docs.rs/tokio/1.41.1/src/tokio/time/sleep.rs.html#12...

vbsd

3 hours ago

Huh, you're right about this, thanks.

On the other hand, I maintain that this is an incidental rather than essential reason for the program finishing quickly. In that benchmark code, we can replace "sleep" with our custom sleep function which does not record start time before execution:

  async fn wrapped_sleep(d: Duration) {
      sleep(d).await
  }

The following program will still finish in ~10 seconds.

  #[tokio::main]
  async fn main() {
      let num_tasks = 100;
      let mut tasks = Vec::new();
      for _ in 0..num_tasks {
          tasks.push(wrapped_sleep(Duration::from_secs(10)));
      }
      futures::future::join_all(tasks).await;
  }

ch33zer

13 hours ago

Tokyo::sleep is async

davidatbu

13 hours ago

I think the points people made in other replies make sense, but "Tokio::sleep is async" by itself is not enough of an explanation. If it were the case that `Tokio::sleep()` tracked the moment `.await` was called as it's start time, I believe it would indeed take 10x1,000,000 seconds, _even if it's async_.

joshka

8 hours ago

RUST

The rust code is really checking how big Tokio's structures that track timers are. Solving the problem in a fully degenerate manner, the following code runs correct correctly and uses only 35MB peak. 35 bytes per future seems pretty small. 1 billion futures was ~14GB and ran fine.

    #[tokio::main]
    async fn main() {
        let sleep = SleepUntil {
            end: Instant::now() + Duration::from_secs(10),
        };
        let timers: Vec<_> = iter::repeat_n(sleep, 1_000_000_0).collect();
        for sleep in timers {
            sleep.await;
        }
    }

    #[derive(Clone)]
    struct SleepUntil {
        end: Instant,
    }

    impl Future for SleepUntil {
        type Output = ();

        fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
            if Instant::now() >= self.end {
                Poll::Ready(())
            } else {
                cx.waker().wake_by_ref();
                Poll::Pending
            }
        }
    }
Note: I do understand why this isn't good code, and why it solves a subtly different problem than posed (the sleep is cloned, including the deadline, so every timer is the same).

The point I'm making here is that synthetic benchmarks often measure something which doesn't help much. While the above is really degenerate, it shares the same problems as the article's code (it just leans into problems much harder).

afavour

11 hours ago

Maybe I’m missing something here but surely Node isn’t doing anything concurrently? Promises don’t execute concurrently, they just tidy up async execution. The code as given will just sequentially resolve a million promises. No wonder it looks so good. You’d need to be using workers to actually do anything concurrently.

charlotte-fyi

11 hours ago

That's not entirely true. There's a thread pool of workers underneath libuv. Tasks that would block do indeed execute concurrently.

afavour

11 hours ago

Oh, I know. But the code used in this test doesn’t utilize that thread pool at all. It just uses setTimeout.

Izkata

11 hours ago

You're thinking of parallelism. Concurrency doesn't require them to actually be running at the same time.

afavour

11 hours ago

Fair point. Kind of makes the comparisons between languages a little unfair, though. Go and Rust would be executing these operations in parallel, Node would not. Would make a significant difference to real world performance!

Izkata

11 hours ago

The measurement is memory, not performance. Paused/queued tasks in the sequential node version still count, and in theory could be worse since the Go and Rust ones would be consuming them in parallel and not building up the queue as much.

lenkite

8 hours ago

> The measurement is memory, not performance.

But, then they can measure memory by simply using a threat pool of size 1 and then submitting tasks to it right ? That would be the equivalent comparison for other languages.

They should launch a million NodeJS processes.

user

11 hours ago

[deleted]

vrnvu

7 hours ago

Conclusion by author: > Now Go loses by over 13 times to the winner. It also loses by over 2 times to Java, which contradicts the general perception of the JVM being a memory hog and Go being lightweight.

Note that Go and Java code are not doing the same! See xargon7 comment.

iforgotpassword

10 hours ago

Can someone explain the node version to me? My js knowledge is from a decade ago. AFAIK, setTimeout creates a timer and returns a handle to it. What does promisify do? I'd assume it's a general wrapper that takes a function that returns X and wraps it so that it returns Promise<X>. So that code actually runs 10k tasks that each create a timer with a timeout of 10 seconds and return immediately.

throwitaway1123

3 minutes ago

Promisify converts a callback based function into a promise returning function [1]. Functions are objects in JS, and if the function object has a `util.promisify.custom` method, promisify will simply return the `util.promisify.custom` method instead of wrapping the original function. Calling promisify on setTimeout in Node is redundant because Node already ships a built in promisified version of setTimeout. So the following is true:

  setTimeout[util.promisify.custom] === require('node:timers/promises').setTimeout

[1] https://nodejs.org/docs/latest-v22.x/api/util.html#utilpromi...

samsartor

11 hours ago

Reminder that Rust does not automatically schedule anything. Unless you _explicitly_ call `tokio::spawn` or `async_std::spawn` you are still living entirely in state-machine land.

Rust's `join_all` uses `FuturesUnordered` behind the scenes, which is pretty intelligent in terms of keeping track of which tasks are ready to make progress, but it does not use tokio/async_std for scheduling. AFAICT the only thing being measured about tokio/async_std is the heap size of their `sleep` implementations.

I'd be very interesting in seeing how Tokio's actual scheduler performs. The two ways to do that are:

- using https://docs.rs/tokio/latest/tokio/task/join_set/struct.Join... to spawn all the futures and then await them

- spawn each future in the global scheduler, and then await the JoinHandles using the for loop from the appendix

As other commenters have noted, calling `sleep` only constructs a state machine. So the Appendix isn't actually concurrent. Again, you need to either put those state machines into the tokio/async_std schedulers with `spawn`, or combine the state machines with `FuturesUnordered`.

jhgg

11 hours ago

I did this measurement, and using time -v, the maximum resident size in KB comes out to 440,424 kb for 1m tasks, 46,820 kb for 100k, and 7,156 kb for 10k.

pgAdmin4

13 hours ago

Why C with pthreads missing in this benchmark ?

throwaway81523

13 hours ago

I don't think 1M posix threads is a thing. 1K is no big deal though.

liontwist

12 hours ago

~100k is a thing on Linux.

thesnide

9 hours ago

that. Or just using a C coroutine lib.

martypitt

10 hours ago

Seriously impressive results from C#. I'm a JVM guy by day, and long-time admirer of C# as a language, but always assumed the two were broadly comparable performance-wise.

This is a sample of 1 usecase, (so questionable real-worldness) but the difference is really eye-opening. Congrats to the C# team!

ReptileMan

9 hours ago

Just a minor nitpick - this is the .NET Runtime vs JVM. My personal observations are that CPU wise they are close, but for reasons unknown JVM has always been more memory hoggish.

The JIT compiler that microsoft created has been nothing short of amazing.

jakobnissen

9 hours ago

Just tried this in Julia: 16.0 GB of memory for 1M tasks!

I believe each task in Julia has its own stack, so this makes sense. Still, it does mean you've got to take account of ~16 KB of memory per running task which is not great.

citrin_ru

10 hours ago

It would be more interesting to see a benchmark where a task will not be empty but would have an open network connection e.g. would make an HTTP request to a test server with 10 seconds response time. Network is a frequent reason real world applications spawn 1M tasks.

rcarmo

10 hours ago

No Erlang, though. That ought to be amazingly small for that kind of synthetic benchmark.

abdellah123

11 hours ago

Can we do something more real world at least? what's the cost (hetzner monthly) of maintaining 1M concurrent websocket connection where each make a query to a postgres db randomly every 1-4 seconds.

The cost wouldn't be just Memory because the network card and CPU also enter the game.

win32_func

10 hours ago

The JS version can quickly be improved to use less memory (~10%).

    async function main() {
        const numTasks = parseInt(process.argv[2], 10);
        const taskDuration = 10000; // 10 seconds

        const tasks = Array.from({ length: numTasks }, () =>
            new Promise(resolve => setTimeout(resolve, taskDuration))
        );

        await Promise.all(tasks); // Wait for all tasks to resolve
        console.log("All tasks completed.");
    }

    main().catch(err => {
        console.error("Error occurred:", err);
    });

bilbo-b-baggins

12 hours ago

This benchmark is nonsense. Apart from the fact that Go has an average Goroutine overhead of a 4kB stack (meaning an average usage of 3.9GB for 1M tasks), the code written is also in a closure, and scheduling a 2nd Goroutine in the wg.Done(), so unlike some of the others it had at least 2M function calls on the event loop stack in addition to at least 1M closure references. So yeah, it’s a great example of bad code in any language.

neonsunset

12 hours ago

Here's an implementation in C# that more faithfully matches what you have to do in Go:

    var cnt = int.Parse(args[0]);
    var evt = new CountdownEvent(cnt);
    for (var i = 0; i < cnt; i++) {
        async Task Execute() {
            await Task.Delay(TimeSpan.FromSeconds(10));
            evt.Signal();
        }
        _ = Execute();
    }
    evt.Wait();
It ends up consuming roughly 264.5 MB on ARM64 macOS 15.1.1 (compiled with NativeAOT).

piokoch

9 hours ago

It seems there are just two clubs: you go with bare metal (Rust, C# native AOT) or you use some higher level abstraction (virtual machine, garbage collector) and then there is no significant difference between Java, Node, Go or Python.

For me Python worked surprisingly well, while Go was surprisingly high on memory consumption.

user

12 hours ago

[deleted]

0xcoffee

12 hours ago

The C# version will copy the list into an array during Task.WhenAll, it may save some memory to use an array directly.

Souce: https://github.com/microsoft/referencesource/blob/master/msc...

user

12 hours ago

[deleted]

neonsunset

12 hours ago

It doesn't take that much space, and not all languages have option to easily map an initial range onto an iterator that produces tasks. Most are dominated by the size of state machines/virtual threads.

Please note that the link above leads to old code from .NET Framework.

The up-to-date .NET implementation lives here: https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

anonymousDan

8 hours ago

Would be interesting to see the numbers for elixir/Erlang.

lxe

13 hours ago

NodeJS does what it was designed to do well.

ankit70

13 hours ago

I wonder how compiled (using Deno or others) JS would perform.

reverseblade2

6 hours ago

It is odd that C# version doesn't increase list capacity:

List<Task> tasks = new List<Task>(numTasks);

datadeft

9 hours ago

No Erlang/Elixir?? The author would be surprised.

skyake

2 hours ago

Erlang/Elixir failed the 1M tasks benchmark that's why it was excluded.

SirGiggles

an hour ago

Most likely they did not adjust +P which bumps the max process limit

pjmlp

10 hours ago

While the idea in general is interesting, the whole site is unreadable in Firefox, text gets all over the place the graphics.

mrweasel

10 hours ago

Seems fine to me.

pjmlp

10 hours ago

Strange, killing Firefox and going again did it for me.

I guess some FF bug then.

Thanks.

voodooEntity

9 hours ago

I came here to rage (im just honest) because the go code example is bad and absolute not representative. Im coding go code for multiple years, especially alot of multithreading, and what is presented there as result is just wrong. Apart from no necessaty to use a waitgroup for threading, as many others here already have stated even with waitgroup you cacn reduce the memory significantly down to like 130mb for 1mio threads.

Also some other languages seem to be missrepresented.

Seems like someone had good intentions but no idea about the languages he tried to compare and the result is this article.

wiseowise

7 hours ago

Remember, folks, none of it matters if nobody uses your software.

tzahifadida

11 hours ago

This test is no real. These languages works differently. Garbage collected and manually allocating and deallocating. If you do not configure a garbage collected language correctly it will spin out of control in memory consumption because it will just not garbage collect. If you would have configured the garbage collection to be low for java and go then go would look like rust.

fortran77

10 hours ago

He should have looked at Erlang.

neonsunset

10 hours ago

Erlang is likely going to have about the same or greater starting overhead as Go here from what I measured[0] with Elixir. Each Erlang process carries its own independent GC which allows it to isolate allocation impact, contributing to the robustness of its implementation. I assume this is where the cost comes from. If you do measure Erlang - please post the numbers.

Processes in Erlang, Goroutines in Go and Virtual Threads in Java do not fully replace lightweight asynchronous state machines - many small highly granular concurrent operations is their strength.

[0]: https://gist.github.com/neon-sunset/8fcc31d6853ebcde3b45dc7a... (disclaimer: as pointed out in a sibling comment it uses Elixir's Task abstraction which adds overhead on top of the processes)

bn-l

10 hours ago

123 MB of ram to run 1 million tasks in rust. That is mind blowing.

ww520

9 hours ago

How do they measure the memory usage?

liontwist

13 hours ago

No baseline against UNIX processes?

masklinn

12 hours ago

A million processes?

liontwist

12 hours ago

I've seen 100k. What happens at a million? How many is unworkable?

Izkata

10 hours ago

  $ cat /proc/sys/kernel/pid_max 
  4194304
My computer can handle that many processes, after that no new processes can be spawned (see: forkbomb)

blitzar

10 hours ago

I am left wondering what happens at 4194305

masklinn

9 hours ago

You can fork bomb your system and observe.

ankit70

13 hours ago

NodeJS is better at memory than go?

hu3

10 hours ago

They are doing different things in this benchmark.

NodeJS has one thread with a very tight loop.

Go actually spawned 1M green threads.

Honestly this benchmark is just noise. Not to say useless in most real world scenarios. Specially because each operation is doing nothing. It would be somewhat useful if they were doing some operation like a DB or HTTP call.

erik_seaberg

13 hours ago

I'd expect that because Promises are small Javascript objects while goroutines each get a stack that grows from at least 2 KB.

dboreham

13 hours ago

Otoh Go actually supports concurrency.

jpgvm

13 hours ago

Well they are all concurrent. I think what you mean is Go is also parallel. As is C#, Rust and Java in this bench.

amazingamazing

13 hours ago

Yet again nodejs surpasses my pre-read expectations 3rd best (generalized) for a million? Wow.

I must be missing something - isn’t Go supposed to be memory efficient? Perhaps promises and goroutines aren’t comparable?

brabel

10 hours ago

If you want to run 1 million coroutines that just sleep in your app, yeah nodejs looks very efficient. The problem is that when each coroutine needs to allocate memory, which I would suppose anything real would do, the 2Kb Go pre-allocates will be an advantage - as it will probably be required except for the most trivial workloads (like in this benchmark) - and then because Go actually runs them in parallel, unlike nodejs, you would likely see a huge improvement in both performance and memory usage with Go or Rust.

fuzzybear3965

13 hours ago

I'm not sure what "memory efficient" means. But, Go sprung as a competitor to Java (portability, language stability, corporate language support/development) and C++ (faster compile times). Can't beat C++ in terms of memory management (performance, guys, not safety) much. But, you can fare well against the JVM, I'm guessing.

jpgvm

13 hours ago

In this benchmark actually no, Go doesn't fare well. There is actually higher static overhead per goroutine than JVM VirtualThread. I presume this is because of a larger initial stack size though/

This probably doesn't matter in the real world as you will actually use the tasks to do some real work which should really dwarf the static overhead is almost all cases.

rwaksmunski

13 hours ago

Be sure to read the Appendix, Rust's state machine async implementation is indeed very efficient.

Comma2976

9 hours ago

I'd be ashamed to publish such shoddy "work" using any real account or name

user

13 hours ago

[deleted]

bob_alderman

9 hours ago

good thing you didn't benchmark ruby because we probably couldn't see the other chart bars.

lowyek

12 hours ago

depends on the tasks.

user

10 hours ago

[deleted]

neonsunset

13 hours ago

To add a data point for Elixir: https://gist.github.com/neon-sunset/8fcc31d6853ebcde3b45dc7a...

Note 1: The gist is in Ukrainian, and the blog post by Steve does a much better job, but hopefully you will find this useful. Feel free to replicate the results and post them.

Note 2: The absolute numbers do not necessarily imply good/bad. Both Go and BEAM focus on userspace scheduling and its fairness. Stackful coroutines have their own advantages. I think where the blog post's data is most relevant is understanding the advantages of stackless coroutines when it comes to "highly granular" concurrency - dispatching concurrent requests, fanning out to process many small items at once, etc. In any case, I did not expect sibling comments to go onto praising Node.js, is it really that surprising for event loop based concurrency? :)

Also, if you are an Elixir aficionado and were impressed by C#'s numbers - know that they translate ~1:1 to F# now that it has task CE, just sayin'.

Here's how the program looks in F#:

    open System
    open System.Threading.Tasks

    let argv = Environment.GetCommandLineArgs()

    [1..int argv[1]]
    |> Seq.map (fun _ -> Task.Delay(TimeSpan.FromSeconds 10.0))
    |> Task.WaitAll

eproxus

10 hours ago

Note that the Task library in Elixir uses supervised processes so it adds a lot more overhead. It would be interesting to see the benchmark with just normal Erlang processes.

user

8 hours ago

[deleted]