Wednesday, January 29, 2014

Parsing HTML With Regular Expressions


You can’t parse [X]HTML with regex. Because HTML can’t be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML.

Tom Christiansen:

It is true that most people underestimate the difficulty of parsing HTML with regular expressions and therefore do so poorly.

But this is not some fundamental flaw related to computational theory. That silliness is parroted a lot around here, but don’t you believe them.

So while it certainly can be done (this posting serves as an existence proof of this incontrovertible fact), that doesn’t mean it should be.

@Marshall That’s the first link. :)

