1

I have one regular expression which convert html to txt. But It is taking a lot of CPU usages . How can I optimize it ?

((\n|\r){2,}) | (\r|\n)|<head.*?</head>|<script.*?</script> |<meta[^>]+>|<style.*?</style> | <[^>]*> |&[^\s]*;
Vivek Goel
  • 19,274
  • 22
  • 97
  • 172
  • 7
    Use a parser instead of a regex. – Matt Ball Jun 01 '11 at 14:08
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags :) – MByD Jun 01 '11 at 14:10
  • I think CPU hike will be more with parser ? – Vivek Goel Jun 01 '11 at 14:10
  • 1
    You really think CPU usage will be high with an HTML parser, @Vivek? Do you notice high CPU usage from your Web browser while *it's* parsing the pages you download? – Rob Kennedy Jun 01 '11 at 14:20
  • @Rob Kennedy . I am writing C++ application. I don't care about dom. So I think using parser will be more cpu high thing. As I only need to strip all html tags and extract text. – Vivek Goel Jun 01 '11 at 14:29
  • @Vivek, The thing to do, when there's a question of speed, is to _measure_. Feed your page to a parser and see how long it takes compared to your regular expression. One quick experiment would settle the matter. – Wayne Conrad Jun 01 '11 at 15:09
  • @Vivek, Web browsers don't use the DOM to process the text they download. They *generate* the DOM by *parsing* the HTML text. And if you wanted to extract text from HTML, then you should have just asked [how to extract text from HTML](http://stackoverflow.com/questions/3605592/how-can-i-extract-text-from-html-using-c). – Rob Kennedy Jun 01 '11 at 15:14
  • @Rob Kennedy. I will try with libxml. – Vivek Goel Jun 01 '11 at 17:00
  • @Rob Kennedy I tried with libxml. It is taking 10 ms to convert html to text. While pcre regular expression is taking only 1 sec. – Vivek Goel Jun 02 '11 at 08:00
  • That's terrific, @Vivek! You can now process 100 files in the time it used to take you to process just one. – Rob Kennedy Jun 02 '11 at 13:30
  • @Rob Kennedy Oh sorry I had given you wrong time for pcre. It was 1 ms not 1 sec. :( – Vivek Goel Jun 02 '11 at 16:22
  • @Goel, yeah but your regex is broken and will not work properly for a lot of valid html. – Qtax Jun 09 '11 at 08:09
  • @Qtax that is acceptable case. If it works for 80% cases. Do you have any suggestion for parsing html ? I was not able to parse malformed html using libxml . – Vivek Goel Jun 09 '11 at 08:57

1 Answers1

1

Use an HTML parser if you can. Regex is bad for HTML for various reasons, and performance will inevitably suffer as well.

Francis Gilbert
  • 2,984
  • 1
  • 20
  • 24
  • Error recovery is challenging in parsers. The only times regexes work well for HTML is when it is **not** general open‐ended HTML, but rather well‐defined HTML snippets. Even then, use [grammatical regexes](http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes/4843579#4843579), not compact ones. The **very best regex approaches** themselves strongly resemble actual lexers, such as in [this answer](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491). I’d stick to existing parsing modules if possible. – tchrist Jun 09 '11 at 17:02
  • @Francies Glibert. I am not able to find parser which can parse malformed html. I see one discussion going on SL but that tell to pase html to first tidy. and than libxml. That will be more resource hungry for me . As I don't care about valid html . I just need to extract text part. – Vivek Goel Jun 10 '11 at 07:49