-1

I have converted a few web pages into string and the string contains these lines(along with other code):

<div class="r"><a href="https://www.apple.com/ca/"
<div class="r"><a href="https://www.facebook.com/ca/"
<div class="r"><a href="https://www.utorrent.com/ca/"

but I just want to strip out the link inside the first line(https://www.apple.com/ca/) and ignore the rest of the HTML and the code. How do I do that?

Arvind Kumar Avinash
  • 50,121
  • 5
  • 26
  • 72
Dr cola
  • 19
  • 3

2 Answers2

2

The easy way:

String url = input.replaceAll("(?s).*?href=\"(.*?)\".*", "$1");

Key points of why this works:

  • regex matches the whole input, but captures the target. The replacement is the capture (group #1). This approach effectively extracts the target
  • (?s) means “dot matches newline”
  • .*? is reluctantly (as little input as possible) matches up to “href"”
  • (.*?) capture (reluctantly) everything up to “"”
  • .* greedily (as much as possible) matches the rest (thanks to (?s) above)
  • replacement is $1 - the first (and only) group in the match
Bohemian
  • 365,064
  • 84
  • 522
  • 658
1

Using the regex mentioned in the answer, given below is the solution using the Java regex API:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String str = "<div class=\"r\"><a href=\"https://www.apple.com/ca/\">Hello</a>\n"
                + "<div class=\"r\"><a href=\"https://www.facebook.com/ca/\">Hello</a>\n"
                + "<div class=\"r\"><a href=\"https://www.utorrent.com/ca/\">Hello</a>";
        String regex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(str);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

Output:

https://www.apple.com/ca/
https://www.facebook.com/ca/
https://www.utorrent.com/ca/
Arvind Kumar Avinash
  • 50,121
  • 5
  • 26
  • 72
  • Hey sir, this is no good because the code is set up to look for the first visible url in the string. The problem is, I want the first URL after the line `
    – Dr cola Sep 18 '20 at 23:34
  • This doesn’t actually answer the question, which was to get *the first* URL *only*. You could add a `break` to the loop, but it sure is a lot of code. – Bohemian Sep 19 '20 at 01:03