{"id":8394,"date":"2014-01-29T12:07:33","date_gmt":"2014-01-29T17:07:33","guid":{"rendered":"http:\/\/mjtsai.com\/blog\/?p=8394"},"modified":"2014-01-29T12:07:34","modified_gmt":"2014-01-29T17:07:34","slug":"parsing-html-with-regular-expressions","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2014\/01\/29\/parsing-html-with-regular-expressions\/","title":{"rendered":"Parsing HTML With Regular Expressions"},"content":{"rendered":"<p><a href=\"http:\/\/stackoverflow.com\/questions\/1732348\/regex-match-open-tags-except-xhtml-self-contained-tags\/1732454#1732454\">bobince<\/a>:<\/p>\n<blockquote cite=\"http:\/\/stackoverflow.com\/questions\/1732348\/regex-match-open-tags-except-xhtml-self-contained-tags\/1732454#1732454\"><p>You can&rsquo;t parse [X]HTML with regex. Because HTML can&rsquo;t be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML.<\/p><\/blockquote>\n<p><a href=\"http:\/\/stackoverflow.com\/questions\/4231382\/regular-expression-pattern-not-matching-anywhere-in-string\/4234491#4234491\">Tom Christiansen<\/a>:<\/p>\n<blockquote cite=\"http:\/\/stackoverflow.com\/questions\/4231382\/regular-expression-pattern-not-matching-anywhere-in-string\/4234491#4234491\"><p>It <em>is<\/em> true that most people underestimate the difficulty of parsing HTML with regular expressions and therefore do so poorly.<\/p>\n\n<p>But this is not some fundamental flaw related to computational theory. That silliness is parroted a lot around here, but don&rsquo;t you believe them.  <\/p>\n\n<p>So while it certainly can be done (this posting serves as an existence proof of this incontrovertible fact), that doesn&rsquo;t mean it&nbsp;<strong><em>should<\/em><\/strong>&nbsp;be. <\/p>\n<\/blockquote>","protected":false},"excerpt":{"rendered":"<p>bobince: You can&rsquo;t parse [X]HTML with regex. Because HTML can&rsquo;t be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"","apple_news_api_id":"","apple_news_api_modified_at":"","apple_news_api_revision":"","apple_news_api_share_url":"","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[4],"tags":[339,270,252,71,234],"class_list":["post-8394","post","type-post","status-publish","format-standard","hentry","category-programming-category","tag-html","tag-parser","tag-perl","tag-programming","tag-regex"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/8394","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=8394"}],"version-history":[{"count":0,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/8394\/revisions"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=8394"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=8394"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=8394"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}