Optimizing Regular Expression

Question

I have one regular expression which convert html to txt. But It is taking a lot of CPU usages . How can I optimize it ?

((\n|\r){2,}) | (\r|\n)|<head.*?</head>|<script.*?</script> |<meta[^>]+>|<style.*?</style> | <[^>]*> |&[^\s]*;

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags :) — MByD, Jun 01 '11 at 14:10
You really think CPU usage will be high with an HTML parser, @Vivek? Do you notice high CPU usage from your Web browser while *it's* parsing the pages you download? — Rob Kennedy, Jun 01 '11 at 14:20
@Rob Kennedy . I am writing C++ application. I don't care about dom. So I think using parser will be more cpu high thing. As I only need to strip all html tags and extract text. — Vivek Goel, Jun 01 '11 at 14:29
@Vivek, The thing to do, when there's a question of speed, is to _measure_. Feed your page to a parser and see how long it takes compared to your regular expression. One quick experiment would settle the matter. — Wayne Conrad, Jun 01 '11 at 15:09
@Vivek, Web browsers don't use the DOM to process the text they download. They *generate* the DOM by *parsing* the HTML text. And if you wanted to extract text from HTML, then you should have just asked [how to extract text from HTML](http://stackoverflow.com/questions/3605592/how-can-i-extract-text-from-html-using-c). — Rob Kennedy, Jun 01 '11 at 15:14
@Rob Kennedy I tried with libxml. It is taking 10 ms to convert html to text. While pcre regular expression is taking only 1 sec. — Vivek Goel, Jun 02 '11 at 08:00
That's terrific, @Vivek! You can now process 100 files in the time it used to take you to process just one. — Rob Kennedy, Jun 02 '11 at 13:30
@Rob Kennedy Oh sorry I had given you wrong time for pcre. It was 1 ms not 1 sec. :( — Vivek Goel, Jun 02 '11 at 16:22
@Goel, yeah but your regex is broken and will not work properly for a lot of valid html. — Qtax, Jun 09 '11 at 08:09
@Qtax that is acceptable case. If it works for 80% cases. Do you have any suggestion for parsing html ? I was not able to parse malformed html using libxml . — Vivek Goel, Jun 09 '11 at 08:57

score 1 · Answer 1 · answered Jun 09 '11 at 16:15

1

Use an HTML parser if you can. Regex is bad for HTML for various reasons, and performance will inevitably suffer as well.

answered Jun 09 '11 at 16:15

Francis Gilbert

2,984
1
20
24

Error recovery is challenging in parsers. The only times regexes work well for HTML is when it is **not** general open‐ended HTML, but rather well‐defined HTML snippets. Even then, use [grammatical regexes](http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes/4843579#4843579), not compact ones. The **very best regex approaches** themselves strongly resemble actual lexers, such as in [this answer](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491). I’d stick to existing parsing modules if possible. – tchrist Jun 09 '11 at 17:02
@Francies Glibert. I am not able to find parser which can parse malformed html. I see one discussion going on SL but that tell to pase html to first tidy. and than libxml. That will be more resource hungry for me . As I don't care about valid html . I just need to extract text part. – Vivek Goel Jun 10 '11 at 07:49

Optimizing Regular Expression

1 Answers1