It seems like you’d need to read at least a few dozen languages fluently to be a...

raphlinus · on Sept 29, 2019

I'm not fluent in any language (except maybe English), but this was my job on Android for about five years, and I still have my hand in. I got very good at being able to spot various forms of incorrect rendering, and of course called in the experts when needed.

One of my favorite examples was Devanagari shaping. In the word ट्विटर (tvitar = Twitter), does the "i" matra shape before the "tv" it in the middle of it? In my sample, people accepted it either way (and you'll see both depending on font), but there are lots of examples where one is just wrong.

rk06 · on Oct 1, 2019

As "Twitter" is a foreign word.there is no set standard to it.

If I were to write it, I would write "tv" first. And draw an extended variant of "I" to cover "tv"

In text rendering, this translates to a custom ligature for an unexpected combo. I am not sure how feasible it is to add ligatures for every such combos

In real life, as long as the reader "gets" it, they would assume that it is the right way to write it :)

pcwalton · on Sept 29, 2019

I've been doing text rendering professionally for a few years now and I wouldn't consider myself fluent in anything other than English and maybe German (I'm a bit rusty in the latter). I can also speak some Japanese, but, other than that, that's it. The rest of my knowledge is a smattering of weird details about various languages that are useful for writing text renderers but terribly unhelpful for actually understanding the languages.

Gankro · on Sept 29, 2019

Yeah there's a reason all my examples use the same ~3 languages, and only a few random fragments from them. You just need a few examples that demonstrate that something can happen. Everything actually works pretty uniformly, so as long as you have a few examples that cover the interesting cases and handle those cases in a general way, everything tends to work fine. The actual font formats are much more nasty and corner-casey. (Thank god you don't need to deal with that, eh Patrick?)

Same reason I prefer to use imperfect terms that capture the important aspects of the problem-space from an english-speaking perspective. Are ligatures the right word for how arabic and marathi get shaped into glyphs? Maybe not, but as long as you get that the æ ligature can be synthesized from ae by a font, and that this is super important for some languages, you're on the right path.

I don't even know what the fragments I use mean, lol. I like to assume I'm just copy-pasting Arabic swears around. Apparently at least one is just Manish's name?

jfk13 · on Sept 29, 2019

> Apparently at least one is just Manish's name?

Yes, I noticed that. :) Both मनीष and منش are "Manish", though he didn't bother with the vowels in the Arabic-script version, so alternative readings are possible.

Manishearth · on Oct 1, 2019

Eh, you really need to understand how scripts work without necessarily being able to read them.

You may need to be able to read some scripts, but learning scripts is much easier than learning languages. Most text rendering experts I know seem to be bilingual or monolingual, but understand the mechanics of a lot more scripts (and can read a couple). Many of them are people who taught themselves about other scripts as they went along.

It's quite easy to talk about text from other scripts in a more clinical way without actually being able to read the script: I've often had text rendering discussions about the Perso-Arabic or Devanagari scripts with folks who can't 100% read the script, but know the mechanics of the script: you can totally describe things in terms of general categories like consonants and vowels (in both scripts they behave differently, an equivalent in the Latin script would be talking about letters and accent marks).

I once wrote https://manishearth.github.io/blog/2017/01/15/breaking-our-l... which goes through the various ways scripts deviate from Latin that most programmers should know. There's a lot that isn't listed there (which only folks working specifically on text would need to care about), but it's not hard to acquire that background to a level well enough to be effective.

As demonstrated in that post you can also "collapse" a lot of scripts together into one set of scripts with similar behavior. A lot of the weirdness in text shaping, for example, is covered by the Perso-Arabic script and any one Indic script. I like to say that there's a reason so many people involved in text shaping are Persians.

Personally, while this stuff isn't my dayjob, I can read around ten scripts to varying degrees of success but I know like ... a couple words from each language whose script I can read. It's not hard to learn to read a script, and as I mentioned you don't even need to be able to read them: If we're counting understanding the mechanics of scripts, my 10 balloons to a number I can't even count, because I can for example now include most Indic scripts. I've had productive conversations in Unicode spaces about e.g the Punjabi script without being able to properly read it.