Soft hyphen (SHY) – a hard problem? (1997–2024)

3 pointsposted 12 hours ago
by keybored

2 Comments

rikroots

11 hours ago

When I was building out a new text layout engine for my canvas library, handling soft hyphens were one of the most annoying parts of the work ... until I moved beyond western fonts and met the line layout issues that other languages impose on their scripts. Not helped, of course, by my severe lack of knowledge (beyond Google search) about how those scripts work.

For instance, CJK languages and the requirement to keep punctuation marks associated with the preceding character so they never start a line. I managed to implement some functionality to recognise the ⁠ word joiner character which can be placed between the punctuation and its preceding character, but only got it working for text laid out in horizontal lines. Things currently break down when the text is arranged in columns - I think there's an extra requirement in Japanese for the punctuation mark to also swivel 90deg relative to its preceding character? I've not yet recovered sufficienet will or strength to investigate further and fix.

As for Thai ... why does that culture not like adding spaces between written words? There is a zero-space character - ​ - which I've added into my layout engine's line layout calculations, but the dev/user has to add those zero spaces into the text themselves. Compare that to modern browsers, which seem to include functionality to automatically parse Thai text to correctly break the stream of glyphs into lines - but there's no way for me to access that functionality so I can replicate it in the canvas. Interestingly, the Thai language has its own dedicated W3C standard[1] so I expect there's lots of other Thai-related layout issues that I'm missing/ignoring.

[1] - https://www.w3.org/International/sealreq/thai/

keybored

9 hours ago

I agree with the author I think. Or this is what I would use for the semantics:

- A soft hyphen is meant to represent a line-break opportunity

- When transforming text for layout, the soft hyphen can be removed or inserted depending on where it would appear

- Ignore when searching text

- It is similar to single linebreaks in MarkDown: they will be stripped in some output formats

- Language-specific concerns (see Swedish) are not considered

The list above is incomplete.

I feel that all the links are too abstract. Here is a concrete example:

A program for inserting linebreaks in a paragraph. The soft hyphen can be used at the end of line to indicate: if you rerun this program you may remove this soft hyphen because it’s not part of the word.

Now you have a character which covers one of the following:

- Hyphen that is a linebreak opportunity

- Hyphen that is not a linebreak opportunity (e.g. in “word- and paragraph-boundaries”: “word-” should not be a linebreak opportunity because you don’t want to put that back as “word-and”)

- Hyphen at the end of the line which was not in the original word (soft hyphen)