1

I only get questions for python here and the tools that I found are mostly for python, so new question: I need to query some things from a HTML site with XPath.

My current code looks like this:

URL url = new URL("http://somesite.com");
connection = (HttpURLConnection) url.openConnection();
connection.connect();

Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder()
                                     .parse(new InputSource(connection.getInputStream()));

XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("//span[@class='a-class']");
String price = (String) expr.evaluate(doc, XPathConstants.STRING);

The problem is, that the page is broken or XPath has some Problems with it to read:

[Fatal Error] :4:254: The entity name must immediately follow the '&' in the entity reference.
org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 254; The entity name must immediately follow the '&' in the entity reference.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:251)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300)

Is there any tool that can read html sites better? Or should I just use a Regex on the page?

Neil
  • 50,855
  • 8
  • 54
  • 69
reox
  • 4,492
  • 8
  • 48
  • 86

1 Answers1

2

Is there any tool that can read html sites better?

People speak highly of jsoup.

T.J. Crowder
  • 879,024
  • 165
  • 1,615
  • 1,639