I only get questions for python here and the tools that I found are mostly for python, so new question: I need to query some things from a HTML site with XPath.
My current code looks like this:
URL url = new URL("http://somesite.com");
connection = (HttpURLConnection) url.openConnection();
connection.connect();
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder()
.parse(new InputSource(connection.getInputStream()));
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("//span[@class='a-class']");
String price = (String) expr.evaluate(doc, XPathConstants.STRING);
The problem is, that the page is broken or XPath has some Problems with it to read:
[Fatal Error] :4:254: The entity name must immediately follow the '&' in the entity reference.
org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 254; The entity name must immediately follow the '&' in the entity reference.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:251)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300)
Is there any tool that can read html sites better? Or should I just use a Regex on the page?