Extracting HTML tags with C++

Question

I'm currently working on a crawler written with C++ for a search engine, the crawler will get a list of HTML files and need to extract HTML tags and put them into a file.

I heard about using an XML parser but I can't figure out how to convert the HTML file to XHTML, in addition to that, converting to XHTML is expensive in term of performance. And html parsers in C++ are almost non-existent.

The third way is using boost regex to extract these tags from the HTML files, but I need to extract all the tags(p, h1, h2, a ...) so it will be a little bit too long to do.

Any other solutions to how can I get HTML tags in C++?

This seems to be a dupe of [Jsoup like html parser for C++](http://stackoverflow.com/questions/17921697/jsoup-like-html-parser-for-c) which, by the way, was the first google result for "c++ html parsing". And the answer is: you want [`QWebElement`](http://qt-project.org/doc/qt-5/qwebelement.html). — Massa, Apr 23 '14 at 18:37
Qt is free, multiplatform software, so, yes (I use mostly Linux myself)... — Massa, Apr 23 '14 at 20:39

score -1 · Accepted Answer · answered Apr 23 '14 at 18:42

-1

Try to parse it using xml parser, I usually use RapidXML Check it here

You will get all tags and attributes of the HTML file.

answered Apr 23 '14 at 18:42

Samer

1,735
2
29
47

can you just explain more, like how can i get the xml file from html thanks – Reda Apr 23 '14 at 19:08
Check out this [http://rapidxml.sourceforge.net/manual.html#namespacerapidxml_1parsing], it shows how you can get `xml_document<> doc;`, which includes all tags and attributes. – Samer Apr 23 '14 at 20:05
Check out this question also: – Samer Apr 23 '14 at 20:11

score -1 · Answer 2 · answered Oct 08 '15 at 12:12

-1

You could use HTML parser from libxml.

answered Oct 08 '15 at 12:12

el.pescado

17,764
2
43
82

Extracting HTML tags with C++

2 Answers2