I have a dirty HTML code that is loaded from a foreign server (so I can't make a json file or clean the html code). My HTML's structure is like:


<div class="pic"> ... </div>

<div class="pic" id="pic311809">

<input type="hidden" class="pic_id" name="pic_id" value="311809" />

<!-- tylko komixxy.pl -->
<div style="font-family: verdana, arial, helvetica, sans-serif; font-weight: bold; font-size: 9px;">
                                        <a href="pic/show_series/1">FFFUUU (rageman)</a>

<h1 class="picture">Kochana babcia</h1>

<div class="infobar">
    Wrzucone 15 października 2010 o 16:03       przez <a href="/user/Astraly">Astraly</a>
    <a href="http://komixxy.pl/311809/Kochana-babcia#comments">Skomentuj (23)</a>
    <!-- głosowanie przeniesione pod spód obrazka -->
</div><!-- .infobar -->

<div class="pic_image">
                <a href="http://komixxy.pl/311809/Kochana-babcia"><img src="http://staticrps.komixxy.pl/uimages/201010/1287151388_by_Astraly_500.jpg" class="pic" alt="Kochana babcia - Wnusiu, a ty jeszcze nie w szkole? Dziś mamy na 10 babciu Co ty tam majaczysz? Jesteś na wagarach!? już ja to powiem twojej mamie! Ale babciu.... Przynosisz nam wstyd! Myślisz, że nie wiem o tej ostatniej niedzieli, w której nie byłeś u komunii? ZAMKNIJ SIĘ KU**A!!!! .... Nie musisz tak krzyczeć! Powiem twojej mamie z jakim tonem odnosisz się do mnie! " /></a>          </div><!-- .pic_image -->

                <div class="source">Źródło: Kto mieszka z babcią, ten wie jak to jest ;)</div>

<!-- głosowanie i ocena -->

<div class="source">

    <div class="infobar center">


        <a href="/pic/vote/311809/up"
             onclick="votowanie(this); return false;"
             class="vote voteup iconlink"
            mocne ↑         </a>


        <a href="/pic/vote/311809/down"
             onclick="votowanie(this); return false;"
             class="vote votedown iconlink"
            słabe ↓         </a>



        <span class="points">
                                87% mocnych

        <span class="count">
                                z 1291 głosów

        <span class="vote_result"></span>

                    | <a href="/user/add_favorite/311809" class="favorite">Do ulubionych</a>

    </div><!-- .infobar -->

    <div style="text-align: center;">
        <fb:like href="http://komixxy.pl/311809/Kochana-babcia"
                         style="width: 130px;">

    <!-- tylko komixxy.pl -->
    <a href="http://komixxy.pl/pic/show_group/311809" class="picbutton">Pokaż podobne komixxy</a>       <a href="http://komixxy.pl/przerob/311809" class="picbutton">Zrób własną wersję</a>
    <div style="clear: both;"></div>

</div><!-- .source -->

</div><!-- .pic -->

<div class="pic"> ... </div>

<div class="pic"> ... </div>

<div class="pic"> ... </div>

I want to select all <div class="pic" id="*"> by using xPath //div[@class='pic'][@id].

Here are two libraries that I used:

- Hpple
- TouchXML

As for Hpple -> it's great but I can't select innerHTML of an emelent. As for TouchXML, I use it for parsing XML and it's great. But it doesn't manage to parse dirty HTML - I get dozens of errors.

Is there a way to parse this HTML in iOS5 using TouchXML? It can be a different library, but I prefer that one.

I heard something about CTidy.h and I did as instructed but nothing's changed...

  • 40,844
  • 15
  • 103
  • 113
  • I would say you are more or less there with a solution. Using a combination of both Hpple and TouchXML will get you the pics you need. – Nik Burns Nov 14 '11 at 16:33
  • But how? Hpple uses simple `libxml2` HTML parsing, so I can't select innerHTML of the element. If I could do this, I would have no problem with parsing... – akashivskyy Nov 14 '11 at 19:52
  • @Kashiv, please provide detail procedure of inserting touchXML library in iOS 5, how do you convert TouchXML library in non-ARC mode? – Tirth Dec 20 '11 at 10:02
  • @RRB There is ARC version of TouchXML on their GitHub respository. ;) – akashivskyy Dec 21 '11 at 06:24
  • @Kashiv, i got it on yesterday, but anyway thanks for helping me. – Tirth Dec 21 '11 at 06:36

2 Answers2


libxml has a module designed exactly for this problem :)


It works exactly the same as libxml normally works i.e. to parse an NSData object containing dirty html:

#include <libxml/htmlparser.h>

htmlDocPtr doc; /* the resulting document tree */
doc = htmlReadMemory([data bytes], [data length], "noname.xml", NULL, HTML_PARSE_RECOVER | HTML_PARSE_NOWARNING | HTML_PARSE_NOERROR);
if (NULL == doc)
    return nil;

... parse DOM here ...


compared to the libxml example from their website :

xmlDocPtr doc; /* the resulting document tree */
doc = xmlReadMemory(content, length, "noname.xml", NULL, 0);
if (NULL == doc)
    return nil;

... parse DOM here ...


PS Don't forget to include libxml2.dylib into your project as a framework in the 'link binary with libraries' project build phase

  • 371,891
  • 67
  • 713
  • 902
  • 37,003
  • 13
  • 93
  • 99

If I was doing this I would parse the HTML before passing it the the libraries and clean out all of the 'dirty' bits find the and and remove everything in between, doing the same for the other dirty areas then it will be easier for the libraries to work with the file.

  • 9,683
  • 6
  • 30
  • 45