I'm currently working on a crawler written with C++ for a search engine, the crawler will get a list of HTML files and need to extract HTML tags and put them into a file.
I heard about using an XML parser but I can't figure out how to convert the HTML file to XHTML, in addition to that, converting to XHTML is expensive in term of performance. And html parsers in C++ are almost non-existent.
The third way is using boost regex to extract these tags from the HTML files, but I need to extract all the tags(p, h1, h2, a ...) so it will be a little bit too long to do.
Any other solutions to how can I get HTML tags in C++?