2

I'm currently working on a crawler written with C++ for a search engine, the crawler will get a list of HTML files and need to extract HTML tags and put them into a file.

I heard about using an XML parser but I can't figure out how to convert the HTML file to XHTML, in addition to that, converting to XHTML is expensive in term of performance. And html parsers in C++ are almost non-existent.

The third way is using boost regex to extract these tags from the HTML files, but I need to extract all the tags(p, h1, h2, a ...) so it will be a little bit too long to do.

Any other solutions to how can I get HTML tags in C++?

halfer
  • 18,701
  • 13
  • 79
  • 158
Reda
  • 89
  • 3
  • 12
  • 1
    This seems to be a dupe of [Jsoup like html parser for C++](http://stackoverflow.com/questions/17921697/jsoup-like-html-parser-for-c) which, by the way, was the first google result for "c++ html parsing". And the answer is: you want [`QWebElement`](http://qt-project.org/doc/qt-5/qwebelement.html). – Massa Apr 23 '14 at 18:37
  • i'm coding in linux , can i use QT in it ? – Reda Apr 23 '14 at 19:12
  • Qt is free, multiplatform software, so, yes (I use mostly Linux myself)... – Massa Apr 23 '14 at 20:39

2 Answers2

-1

Try to parse it using xml parser, I usually use RapidXML Check it here

You will get all tags and attributes of the HTML file.

Samer
  • 1,735
  • 2
  • 29
  • 47
  • can you just explain more, like how can i get the xml file from html thanks – Reda Apr 23 '14 at 19:08
  • Check out this [http://rapidxml.sourceforge.net/manual.html#namespacerapidxml_1parsing], it shows how you can get `xml_document<> doc;`, which includes all tags and attributes. – Samer Apr 23 '14 at 20:05
  • Check out this question also: – Samer Apr 23 '14 at 20:11
-1

You could use HTML parser from libxml.

el.pescado
  • 17,764
  • 2
  • 43
  • 82