Regular Expression Performance
In this very well written article, Russ Cox explains why most current regular expression implementations have exponential running times for certain patterns, while Ken Thompson’s NFA implementation from the 1960s is linear for all inputs. Cox suggests that the NFA approach is compatible with modern regex features and that this is how regular expressions should be implemented. My hunch is that the current recursive backtracking implementations would be faster and use less memory for most real-world, non-pathological inputs and that they have better engineering characteristics, such as being easier to understand, debug, extend, and modify. Perhaps there are clever tricks that can be employed with the NFA implementation, but I suspect that this is an area where the theory is very nice but doesn’t lend itself to an efficient and modular implementation. Thus, I favor a hybrid approach where the simpler pathological patterns are dispatched to a Thompson NFA and the rest are handled in the normal way. (Via John Gruber.)