9
lorentz
1y

Unicode's biggest problem is that it isn't a streamable format. Given a section of a Unicode string, it's impossible to assert that the next character won't be an accent or zwj or other modifier. This means that it's impossible to convert stdin into an iterator over canonicalized Unicode graphemes.

Comments
  • 1
    The fix would have been to make every modifier prefix. ZWJ could have been in Polish notation. Everything except for retroactive accent keys on keyboards would work exactly as it used to. Accent keys would either have to be proactive or programmed to signal "backspace"; "replay since last starter"; "new accent"
  • 1
    but the vast majority of keyboards use single key accented letters instead of retroactive accent keys anyway.
  • 1
    Interesting. Never thought of that.
  • 0
    I guess if you think of the modifiers as characters themselves it makes sense.
  • 0
    This is a "problem" not of Unicode, but of all streams due to their very nature.

    The concept of taking out a fragment of it simply does not make sense.

    You can never know what is coming next until you read it, which is why EOF exceptions exist :p

    If you use the stream from the beginning to end, it can perfectly be a stream of graphemes. You simply can't print one until you read the next non modifier one.

    This happens on UTF-8 streams too (which are ubiquitous). You simply can't print a glyph until you read all the bytes necessary for it, and even then, the next one could be a modifier too, do you have to apply the same logic.
  • 0
    @CoreFusionX No it's not a shared property of all streams, there are a ton of data formats where the relevance of the next byte can be asserted based on the value of all previous ones. Unicode could have been a streamable format too, all that's needed is for modifiers to act as prefixes. You can stream JSON from a server and process all elements as soon as you receive them without waiting for the start of the next element or EOF, because everything has unambiguous terminators.
  • 0
    @lorentz

    That's simply not true.

    You can be parsing a JSON array of strings.

    You get a terminating ", and know that the current string has ended.

    Can you emit the array?

    You don't know. Next char could be a comma or a sqbracket. You can't neither assert what comes next nor act on your current state.

    Which is why I said it makes no sense to try and reason on *fragments* of a stream. Streams have no concept of "datagram" (to parallel the socket concept).

    You can never assert from a fragment whether you have a complete datagram or not.
  • 0
    @CoreFusionX If the array ends, you get a ]. If you get a comma you can't emit the array, but that's because there is actual information - array elements - missing. The presence or absence of missing data is unambiguous from the data you have. If you have a stream of Unicode graphemes, you always have to hold back the last one in case the next codepoint is a modifier.
  • 0
    In a streamable format "end of element" and "end of stream" are distinct unambiguous signals that are always either deducible or explicitly stated. Unicode does not have an "end of element" signal.
  • 0
    @lorentz

    Yes, I get that. What I'm saying is that the meaning of end of datagram can be explicit or implicit.

    In your case, the Unicode grammar dictates that your end of datagram is EOF or getting a new non modifier, and that's how stuff like terminals have to do it anyway.
  • 0
    @CoreFusionX I can assure you that a terminal doesn't wait until the next character or EOF arrives to print the current grapheme, otherwise you couldn't print a string without newline, or the last letter would be missing. They redraw the last grapheme if the next codepoint is an accent or other modifier supported by the font, or maybe just expect that buffers contain whole datagrams, which is a flawed assumption but it works well enough in an anglocentric environment such as the terminal. This is the only way a streaming Unicode recipient can work because, again, Unicode doesn't allow you to conclude that the last grapheme you received is complete other than by ending the stream.
  • 0
    :s/whole datagrams/whole graphemes/
Add Comment