Does any programming language get Unicode right all the way? I thought Python did it mostly correctly, but for example with the composing characters, I would argue that it gets it wrong if you try to reverse a Unicode string.
My basic litmus test for "does this language support Unicode" is, "does iterating over a string get me code points?"¹
Rust, and recent versions of Python 3 (but not early versions of Python 3, and definitely not 2…) pass this test.
I believe that all of JavaScript, Java, C#, C, C++ … all fail.
(Frankly, I'm not sure anything in that list even has built-in functionality in the standard library for doing code-point iteration. You have to more or less write it yourself. I think C# comes the closest, by having some Unicode utility functions that make the job easier, but still doesn't directly let you do it.)
¹Code units are almost always, in my experience, the wrong layer to work at. One might argue that code points are still too low level, but this is a basic litmus test (I don't disagree that code points are often wrong, it's mostly a matter of what can I actually get from a language).
> try to reverse a Unicode string.
A good example of where even code points don't suffice.
Lua 5.3 can iterate over a UTF-8 string. You can even index character positions (not byte positions) in a UTF-8 string. Some more information here: https://www.lua.org/manual/5.3/manual.html#6.5
I basically agree with you, but note that code points are not the same as characters or glyphs. Iterating over code points is a code smell to me. There is probably a library function that does what you actually want.
I explicitly mention exactly this in my comment, and provide an example of where it breaks down. The point, which I also heavily noted in the post, is that it's a litmus test. If a language can't pass the iterate-over-code-points bar, do you really think it would give you access to characters or glyphs?
However, it exposes the encoding directly as a sequence of 16-bit ints. In other words, if you iterate over a string or index it, you're getting those, and not codepoints (i.e. it doesn't account for surrogate pairs).
Note that this only applies to iteration and indexing. All string functions do understand surrogates properly.
I'd accept that, too. I'm not familiar w/ Swift, so it wasn't in the list above. (But I do think the default should not be code units, or programmers will use it incorrectly out of ignorance. Forcing them to choose prevents that, hence, I'll allow it)
I would go one step further and say that there is no meaningful default for these things. It all depends on the context, and there's no single context that is so common that it's the only one that most people ever see. Thus, it should always be explicit, and an attempt to enumerate or index a string directly should not be allowed - you should always have to spell out if it's the underlying encoding units, or code points, or text elements, or something else.
How does Clojure or ClojureScript do as they are built on top of JVM/CLR or JavaScript? I'm assuming that since they fall back to many of the primitives of their respective runtimes that they use their implementations.
I've had the least trouble when using Apple's Objective-C (NSString), and Microsoft's C# - these two at least make you take conscious decisions when transcoding to bytes.