I am trying to use JSoup to find all internal hyperlinks within a URL. I used two approaches: DOM and CSS Selector. The DOM approach extracts more than the internal links. The CSS Selector approach does not extract anything. My source code follows.
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://stackoverflow.com/questions/2793150").get();
System.out.println("** Using selector syntax **");
extractUsingSelectorSyntax(doc);
System.out.println("\n\n");
System.out.println("** Using DOM methods **");
extractUsingDOMMethods(doc);
}
public static void extractUsingSelectorSyntax(Document doc) {
String selectorStr = "a[href^=#*]";
// Under anchor nodes select the value of the href attribute that starts with
// the '#' character, followed by 0 or more other characters
Elements anchors = doc.select(selectorStr);
for (Element link : anchors) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println(linkText + " | " + linkHref);
}
}
public static void extractUsingDOMMethods(Document doc) {
Elements anchors = doc.getElementsByAttributeValueMatching("href", "#*");
for (Element link : anchors) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println(linkText + " | " + linkHref);
}
}