senkora
3 days ago
Hangul is great for computer-entry, but the data representation is a little tricky, because syllables are treated as a single glyph and there are many syllables.
I found this old comment that explains it better than I can: https://news.ycombinator.com/item?id=28287811
kijin
3 days ago
The data representation is fairly straightforward once you're familiar with the composition rules, at least for modern Korean.
Unicode simply lists all possible combinations in dictionary order starting from U+AC00. So you can take any code point and split out the 초성, 중성 and 종성 using simple arithmetic, just like you can figure out Latin alphabets from their ASCII codes.
hyeonwho4
3 days ago
초성 = initial sound (consonant) 중성 = middle sound (vowel) 종성 = final sound (consonant)
My understanding is that there are two possible unicode encodings of Korean, one of which (MacOS) is sound by sound instead of syllable by syllable (Windows). This is why Korean UTF-8 filenames from MacOS appear broken on modern Windows machines.
kijin
3 days ago
Yeah, it's stupid that Windows can't normalize the two completely valid ways of expressing Hangul in Unicode. If they can process e + acute accent = é, they should be able to do ㄱ + ㅏ = 가.
Having said that, MacOS also made the strange choice of expressing Hangul using the Hangul Jamo (by sound) Unicode block even when there are equivalent precomposed symbols in the Hangul Syllables block. Encoding each sound individually takes up 2-3 times more storage, just like with accented characters in Latin. Besides, if you just list sounds and rely on them to be combined automatically, what do you do when you legitimately want to write a sequence of uncombined sounds, like ㄱㅏㅁ instead of 감?
samatman
3 days ago
Rare WalterBright L taken in that thread.
Sure, Unicode isn't the Platonic ideal of a character encoding. It has warts, legacy features, and.. and it is a universal encoding of all human writing. What an exceptional and incredible accomplishment.
Could you replace it with something better designed?
No. No, you cannot. You can in principle design something better, but that's a completely different, quixotic, and useless task.
It's also far from impossible to implement Unicode 'correctly', folks not only can, but do, routinely. It's extensively well documented, there's example code, it's just work.
Also, if your game plan for Unicode-D includes removing the most beloved and consistently demanded feature, emoji: then no, that person in particular is not capable even in principle of designing something better. That game has been lost before it began.
zzo38computer
2 days ago
> and it is a universal encoding of all human writing
It isn't (and never can be).
> Could you replace it with something better designed? No. No, you cannot. You can in principle design something better,
Something that some people fail to consider, is that one character set is not suitable for all purposes. Unicode is not very good for most purposes though. I think Extended TRON Code has many advantages, although trying to use Extended TRON Code (or some other alternative) for everything would be almost as bad as using Unicode for everything, but in different ways.
> Also, if your game plan for Unicode-D includes removing the most beloved and consistently demanded feature, emoji
I think that colourful emoji should not belong in the character set for text. I also do not want colourful emoji on my computer.
lgessler
3 days ago
This has led to work showing that models can do better sometimes if you decompose these into their constituent characters, e.g.: https://aclanthology.org/2022.emnlp-main.472.pdf
bobthepanda
3 days ago
A paper on Korean where the main acronym is BTS has got to be intentional, right?
lgessler
3 days ago
> We hope that our BTS will light the way up like dynamite[9] for future research on Korean NLP.