Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Years ago I read a rant by someone who insisted that being able to mix arbitrary languages into a single String object makes sense for linguists but for most of us we would be better off being able to a assert that a piece of text was German, or sanskrit, not a jumble of both.

Presumably the person who wrote it speaks a single language.

Just because something is not useful to them, it doesn't mean it is not useful in general. There are millions of polyglots as well as documents that include words and names in multiple scripts.



I think in that case the idea would either be that you should then have an array of strings, each of which may have its own language set, or that the string should be labelled as "containing Latin and Cyrillic", but still not able to include arbitrary other characters from Unicode. And multi-lingual text still generally breaks on words... Kilobytes of Latin text with a single Cyrillic character in the middle of a word is very suspicious, in a way that kilobytes of Latin text with a single Cyrillic word isn't.

Of course you'd always need an "unrestricted" string (to speak to the rest of the system if necessary), but there are very few natural strings out there in the world that consist of half-a-dozen languages just mishmashed together. Those exceptions can be treated as exceptions.


How would that look like to an end-user? Do I need to tell my browser which scripts are contained in my emails? What would happen if I start typing in a different script?

> there are very few natural strings out there in the world that consist of half-a-dozen languages just mishmashed together

...in your experience. What is an almost-absurd exception to you, is every day life to others.


> Presumably the person who wrote it speaks a single language.

Presumably the person who wrote it speaks English.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: