zX41ZdbW
2 hours ago
This and similar tasks can be solved efficiently with clickhouse-local [1]. Example:
ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt
I've tested it and it is faster than both sort and this Rust code: time LC_ALL=C sort data.txt | uniq -c | sort -rn > /dev/null
32 sec.
time hist data.txt > /dev/null
14 sec.
time ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt > /dev/null
2.7 sec.
It is like a Swiss Army knife for data processing: it can solve various tasks, such as joining data from multiple files and data sources, processing various binary and text formats, converting between them, and accessing external databases.[1] https://clickhouse.com/docs/operations/utilities/clickhouse-...
gigatexal
9 minutes ago
Exactly. I love this and DuckDb and other such amazing tools.
nasretdinov
2 hours ago
To be more fair you could also add SETTINGS max_threads=1 though?
supermatt
an hour ago
How is that “more fair”?
nasretdinov
37 minutes ago
Well, fair in a sense that we'd compare which implementation is more efficient. Surely, ClickHouse is faster, but is it because it's using actually superior algorithms or is it just that it executes stuff in parallel by default? I'd like to believe it's both, but without "user%" it's hard to tell
mickeyp
34 minutes ago
Last time I checked, writing efficient, contention-free and correct parallel code is hard and often harder than pulling an algorithm out of a book.