Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You cannot use the "basic" string manipulation functions (strcmp, strlen, etc.) because these are not unicode-aware.

However, you have the multibyte string functions family that can operate in a wide range of encodings (including UTF-8 which is the default in any sane installation nowadays).

[1] https://www.php.net/manual/en/ref.mbstring.php



I think I had issues even with mbstring, for some characters like "œ". But maybe I'm wrong.


œ works fine with mb_strlen(). What might have been tripping you up is combining character sequences:

https://3v4l.org/DM4pC

Handling those "correctly" with a string length function gets complicated in any language, as there isn't a 1-to-1 mapping between Unicode codepoints and visible glyphs.


In PHP grapheme_strlen achieves what you're describing: https://3v4l.org/HPOb3


Yes, I think you nailed what my issue was.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: