Parse broken HTML Sites with XPath

Question

I only get questions for python here and the tools that I found are mostly for python, so new question: I need to query some things from a HTML site with XPath.

My current code looks like this:

URL url = new URL("http://somesite.com");
connection = (HttpURLConnection) url.openConnection();
connection.connect();

Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder()
                                     .parse(new InputSource(connection.getInputStream()));

XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("//span[@class='a-class']");
String price = (String) expr.evaluate(doc, XPathConstants.STRING);

The problem is, that the page is broken or XPath has some Problems with it to read:

[Fatal Error] :4:254: The entity name must immediately follow the '&' in the entity reference.
org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 254; The entity name must immediately follow the '&' in the entity reference.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:251)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300)

Is there any tool that can read html sites better? Or should I just use a Regex on the page?

score 2 · Accepted Answer · answered Apr 09 '13 at 07:43

2

Is there any tool that can read html sites better?

People speak highly of jsoup.

answered Apr 09 '13 at 07:43

T.J. Crowder

879,024
165
1,615
1,639

wow, jsoup works like a charm! – reox Apr 09 '13 at 08:31
@reox: Cool! Glad that helped. – T.J. Crowder Apr 09 '13 at 08:33

Parse broken HTML Sites with XPath

1 Answers1