Lexing is such a small part of compilation overhead that not being optimal isn't...

saurik · on Feb 4, 2014

> Lexing is such a small part of compilation overhead that not being optimal isn't going to kill you.

You still then need to just the extra boilerplate per element. If you use a tool, you are literally looking at just "keyword <space> code when that keyword is pressed" without any surrounding "how to check if it is that keyword". This is easier to write and easier to maintain, in addition to the advantages I discussed earlier.

> In my case, writing the elder by hand is necessary because I have to memoize token identities (and all the parsing/typing info attached).

If I understand what you mean, then this is trivially done with most existing tools as part of your token rule (retired an interned string, which you can use an existing data structure for). If not, then you would first write a tool. Again: it may feel really "hard core" to write a lexer, but it is repetitive code that a good design factors out into a lexer generator.

> Yes, I write my own hash tables also for the same reason (so they support incremental computations).

Careful: I imagine you mean to say you write your own hash table library/compiler, which is different from laying out the hashtable by hand. I have also written my own hashtable for many reasons, but I would never sit down with an array literal and manually put the entries in there by hand: even if I don't make any mistakes, it is a pointless waste of my time that is trivially automatable using a computer.

seanmcdirmid · on Feb 4, 2014

Keywords are quite easy: just plug your identifiers into the hash table after their boundaries are detected. Again, it is much more expensive than generating a FSM, but the expense is really in the noise when everything else is considered.

As for incremental lexing, you need to tell if your tokens pre-existed as the same kind (not necessarily the same string!) before the edit or not. It would be trivial to add this to a generator, but how would it then feedback the signals needed to take advantage of the memoization (e.g. by providing a persistent token ID that can unlock pre-existing information about the token). There are simply no standards for that.

In most of the professionally written compilers (e.g. scalac) I've worked on, lexer and parser generators aren't even used, and it really isn't that big of a deal to write these in code; you also get the benefit that the same language is being used. This becomes especially true when IDE services are considered, whereas most generators are pretty much limited to batch applications.

> I have also written my own hashtable for many reasons, but I would never sit down with an array literal and manually put the entries in there by hand: even if I don't make any mistakes, it is a pointless waste of my time that is trivially automatable using a computer.

I see your point, but it really depends on the key space you are optimizing for. You might just put the elements in by hand if a generic algorithm isn't really called for.