ks2048
3 days ago
That's only the tip of the iceberg of hyphen-looking characters.
Here's some more,
2010 ; 002D ; MA #* ( ‐ → - ) HYPHEN → HYPHEN-MINUS #
2011 ; 002D ; MA #* ( ‑ → - ) NON-BREAKING HYPHEN → HYPHEN-MINUS #
2012 ; 002D ; MA #* ( ‒ → - ) FIGURE DASH → HYPHEN-MINUS #
2013 ; 002D ; MA #* ( – → - ) EN DASH → HYPHEN-MINUS #
FE58 ; 002D ; MA #* ( ﹘ → - ) SMALL EM DASH → HYPHEN-MINUS #
06D4 ; 002D ; MA #* ( ۔ → - ) ARABIC FULL STOP → HYPHEN-MINUS # →‐→
2043 ; 002D ; MA #* ( ⁃ → - ) HYPHEN BULLET → HYPHEN-MINUS # →‐→
02D7 ; 002D ; MA #* ( ˗ → - ) MODIFIER LETTER MINUS SIGN → HYPHEN-MINUS #
2212 ; 002D ; MA #* ( − → - ) MINUS SIGN → HYPHEN-MINUS #
2796 ; 002D ; MA #* ( → - ) HEAVY MINUS SIGN → HYPHEN-MINUS # →−→
2CBA ; 002D ; MA # ( Ⲻ → - ) COPTIC CAPITAL LETTER DIALECT-P NI → HYPHEN-MINUS # →‒→
copied from https://www.unicode.org/Public/security/8.0.0/confusables.tx...renhanxue
3 days ago
Three Minus Signs for the Mathematicians under the pi,
2212 MINUS SIGN
2796 HEAVY MINUS SIGN
02D7 MODIFIER LETTER MINUS SIGN
Seven Dashes for the Dash-lords in their quotes as shown, 2012 FIGURE DASH
2013 EN DASH
2014 EM DASH
2015 QUOTATION DASH
2E3A TWO-EM DASH
2E3B THREE-EM DASH
FE58 SMALL EM DASH
Nine Hyphens for Word Breakers, one of them ­, 00AD SOFT HYPHEN
058A ARMENIAN HYPHEN
1400 CANADIAN SYLLABICS HYPHEN
1806 MONGOLIAN TODO SOFT HYPHEN
2010 HYPHEN
2011 NON-BREAKING HYPHEN
2E17 DOUBLE OBLIQUE HYPHEN
2E40 DOUBLE HYPHEN
30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN
One for the Dark Word in the QWERTY zoneIn the land of ASCII where Basic Latin lie.
One String to rule them all, One String to find them,
One String to bring them all and in the plain-text, bind them
In the land of ASCII where Basic Latin lie.
002D HYPHEN-MINUS
- @FakeUnicode on Twitter, with apologies to J. R. R. Tolkienmarkus_zhang
3 days ago
I think it's a good idea to write a plugin for any IDE to highlight those confusing characters.
MrJohz
3 days ago
I know vscode had this feature built in, and it's come in handy a couple of times for me.
samatman
3 days ago
VSCode does this out of the box actually. Ended up putting a few on a whitelist while writing Julia, where it can get kind of ugly (puts a yellow box around them).
userbinator
3 days ago
Using an ASCII-only font automatically shows all characters that IMHO should not be present in source code.
makeitdouble
3 days ago
A note on non-ascii in code: I thought of it as an abomination, until hitting test pattern descriptors.
On a project targeted at a non English speaking devs with a strong domain knowledge requirement, writing the test patterns (endless arrays of input -> expected output sequences, interspersed with adjustment code) in the native language saves an incredible amount of time and effort, in particular as we don't need to translate obscure notions into even more obscure English.
And that had very little downsides as it's not production running code, lining will still raise anything problematic, and and the whole thing is easier to get reviewed by non domain experts.
We could have made a translation layer to have the content in a spreadsheet and convert it to test code, but that's not any more stable than having unicode names straight into the code.
nine_k
2 days ago
String constants / symbols is one domain, keywords and reserved characters, another. They should be checked for different things. E.g. spell-checking string constants as plain text if they look as plain text is helpful. Checking for non-ASCII quotes / dashes / other punctuation outside quoted strings, where they can only occur by mistake, is also helpful.
makeitdouble
2 days ago
My comment got mistakenly autocorrected (meant "linting" instead of "lining"), which is so on point given the subject.
I agree, and think a decent linter can deal with these issues, and syntax highlighting as well.
In particular these kind of rules tend to get complicated with many exceptions (down to specific folders needing dedicated rules), so doing it as lint and not at the language level gives a lot of freedom on where and how to apply the rules and raise warnings.
keybored
3 days ago
For every such Unicode problem (which is a data input^W source problem, not a programming source code error) there are fifty problems caused by the anemic ASCII character set like Unix toothpicks and three layers of escaping due to using too uniform delimiters.
(Granted this is heavily biased since so much source code is ASCII-only so you don’t get many Unicode problems in the first place...)
PaulHoule
2 days ago
It's a very unpopular opinion but I use as much Unicode as I can in source code. In comments for instance I can write
x²
as well as italic and bold characters (would have demoed but HN filters out Unicode bold & italics) and I can write a test named processes中文Characters()
and also write Java that looks like APL, add sigil characters in code generated stubs that will never conflict with other people's code because they're too afraid to use these characters, etc.https://github.com/paulhoule/ferocity/blob/main/ferocity-std...
People will ask "how do you enter those characters?" and I say "I don't know but I can cut and paste them, they get offered by the autocomplete, etc."
1-more
a day ago
I had a beautiful vision when programming my keyboard. The style at the time was to write a massive array in C with the keycodes for the various layers. I put commented out box drawing characters between the lines to delineate where the keys are. I wanted to use the C Preprocessor to #define the thin vertical box drawing character as a comma, but somehow that was out of the range of acceptable characters. If I had that, then my source would be 1% more readable to me, the only person who's ever going to use it.
https://github.com/qmk/qmk_firmware/compare/master...perkee:...
I still use tons of box drawing characters in comments. I'm actually writing a little doodad to let me edit them fluidly then copy them into my block comments, because a truth table is easy to read!
Your comment also reminds me of the introduction of type parameters/generics in Go via the Canadian Aboriginal syllabary for "po" and "pa" ("ᐸ" and "ᐳ") https://github.com/vasilevp/aboriginal
Arnt
2 days ago
Hardly unpopular where I live. Lots of source code contains € and much else. Grepping for it in the code I worked on last week, I find non-ASCII character in dozens of tests, in some scripts that seem to be part of CI, in a comment about a locale-specific bug, and I stopped looking there.
How to enter them? Well, the keys are on the keyboard.
PaulHoule
2 days ago
If you're in Euro land.
I have a lot of personal interest in Chinese language content these days, I have no idea how to set up and use an "input method" but I either see the text I want in front of me or ask an LLM "How do I write X in Chinese?" and either way cut and paste.
sigseg1v
2 days ago
Chinese enter words into a keyboard using the same type of keyboard you would use in North America. The characters are entered as "pinyin" which is a romanized phonetic method of describing Chinese words. You should be able to enter it into your keyboard on Windows for example by enabling Simplified Chinese / pinyin in the language input settings.
Arnt
2 days ago
That's pretty much "type an ASCII representation of a reasonable pronounciation, then pick the right character from the drop-down menu". Details vary but that's the gist.
powersnail
3 days ago
That would make it impossible to edit non-ascii strings, like texts in foreign languages. As far as I know, most editors/IDE don't support switching fonts for string literals. It is more feasible for a syntax highlighter to highlight non-ascii characters outside of literals.
Someone
3 days ago
> As far as I know, most editors/IDE don't support switching fonts for string literals
When asked to render an Unicode character that isn’t present in the font modern OSes will automatically pick a font that has it.
https://en.wikipedia.org/wiki/Fallback_font: “A fallback font is a reserve typeface containing symbols for as many Unicode characters as possible. When a display system encounters a character that is not part of the repertoire of any of the other available fonts, a symbol from a fallback font is used instead. Typically, a fallback font will contain symbols representative of the various types of Unicode characters.”
That can be avoided, for example by storing text as “one character per byte”, but I don’t think many editors do that nowadays.
powersnail
2 days ago
But that would not distinguish between chars inside a string literal and chars outside of a string literal.
lifthrasiir
3 days ago
String literals frequently have non-ASCII characters to say the least.
user
2 days ago
user
2 days ago
oneeyedpigeon
2 days ago
It depends on whether you count html as "source code", but if so, then non-ASCII characters absolutely should be present!
metadat
3 days ago
Some platforms, such at python3 have full UTF-8 support already, so what is the problem?
userbinator
3 days ago
The one shown very clearly by this article.
keybored
3 days ago
The wrong values are from PDF files. Maybe you mean using a system-wide ASCII-only font but you finished your point with “should not be present in source code”. Source code wasn’t the problem here.
foobarchu
2 days ago
It very much is a problem in source code too though. It's unfortunately common in college courses (particularly non-CS courses with programming like bioinformatics) for instructors to distribute sample code as word docs. Cue students who can't run the code and don't know why because Word helpfully converted all double quotes to a "prettier" Unicode equivalent.
keybored
2 days ago
Bizarrely I have experienced the same thing from Latex with its purpose-made code/literal blocks.
But the most shocking thing are printed learning resources on things like Haskell where the code examples on purpose are some kind of typographic printout rather than just the symbols themselves!
metadat
3 days ago
Thanks usrbinator.. guilty grimace smile
Maybe highlighting isn't such bad idea :)
mjevans
3 days ago
Also remember to squash 'wide' characters back to the ASCII table where possible, if the data is being processed by normal tools.
There are honestly so many data-cleaning steps a pipeline could need / have to produce programatically well-formatted data.
tracker1
2 days ago
Yeah, quotes and magic quotes are another set... Nothing like discovering MySQL treats magic quotes as ANSI quotes for purposes of SQL (injection)... AddSlahshes wasn't enough.
For what it's worth TFA could still use a regexp, it would just be slightly more complex. But the conditional statement may or may not be faster or easier to reason with.
toastal
3 days ago
And yet all of these serve a different, useful purpose for semantics.
account42
2 days ago
As TFA shows, no they don't. They may have been intended for different semantics but once humans come into play if it looks vaguelly correct then its getting used.