Using JSoup to find all internal hyperlinks

Question

I am trying to use JSoup to find all internal hyperlinks within a URL. I used two approaches: DOM and CSS Selector. The DOM approach extracts more than the internal links. The CSS Selector approach does not extract anything. My source code follows.

public static void main(String[] args)  throws IOException {
    Document doc = Jsoup.connect("https://stackoverflow.com/questions/2793150").get();
    
    System.out.println("** Using selector syntax **");
    extractUsingSelectorSyntax(doc);
    
    System.out.println("\n\n");
    
    System.out.println("** Using DOM methods **");
    extractUsingDOMMethods(doc);
}

public static void extractUsingSelectorSyntax(Document doc) {
    String selectorStr = "a[href^=#*]";
    // Under anchor nodes select the value of the href attribute that starts with
    // the '#' character, followed by 0 or more other characters

    Elements anchors = doc.select(selectorStr);

    for (Element link : anchors) {
        String linkHref = link.attr("href");
        String linkText = link.text();
        System.out.println(linkText + " | " + linkHref);
    }
}

public static void extractUsingDOMMethods(Document doc) {
    Elements anchors = doc.getElementsByAttributeValueMatching("href", "#*");
    for (Element link : anchors) {
        String linkHref = link.attr("href");
        String linkText = link.text();
        System.out.println(linkText + " | " + linkHref);
    }
}

score 0 · Answer 1 · answered Mar 03 '21 at 16:51

Your a[href^=#*] selector does not work as you think it should. Jsoup treats the asterisk at the end as a 'normal' asterisk, not as 'any character.
You should omit it in order to get some output - a[href^=#].
If you want to use regex, you should use the `matches' pseudo-selector:

:matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)

Using JSoup to find all internal hyperlinks

1 Answers1