Detect and extract url from a string?

Question

This is a easy question,but I just don't get it. I want to detect url in a string and replace them with a shorten one.

I found this expression from stackoverflow,But the result is just http

Pattern p = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]",Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(str);
        boolean result = m.find();
        while (result) {
            for (int i = 1; i <= m.groupCount(); i++) {
                String url=m.group(i);
                str = str.replace(url, shorten(url));
            }
            result = m.find();
        }
        return html;

Is there any better idea?

score 95 · Answer 1 · answered Apr 19 '11 at 08:53

95

Let me go ahead and preface this by saying that I'm not a huge advocate of regex for complex cases. Trying to write the perfect expression for something like this is very difficult. That said, I do happen to have one for detecting URL's and it's backed by a 350 line unit test case class that passes. Someone started with a simple regex and over the years we've grown the expression and test cases to handle the issues we've found. It's definitely not trivial:

// Pattern for recognizing a URL, based off RFC 3986
private static final Pattern urlPattern = Pattern.compile(
        "(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"
                + "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"
                + "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",
        Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Here's an example of using it:

Matcher matcher = urlPattern.matcher("foo bar http://example.com baz");
while (matcher.find()) {
    int matchStart = matcher.start(1);
    int matchEnd = matcher.end();
    // now you have the offsets of a URL match
}

answered Apr 19 '11 at 08:53

WhiteFang34

66,579
17
101
108

3

This one unfortunately also matches a dot following the URL. – Christian Brüggemann Sep 03 '15 at 15:31
4

Doesn't handle URLs in text correctly. Preceding whitespace is incorrectly handled (newlines swallowed), and accepts colons, dots etc after the URL. – Thomas Wana May 16 '16 at 09:02
3

Doesn't work with something like `google link` It returns `"www.google.com` – 4gus71n Dec 16 '16 at 18:23
doesn't work if url is in parenthesis (www.myurl.com) - returns "www.myurl.com)" – ed22 Sep 17 '17 at 05:41
it doesn't work when string contains `\n`: `Sources:\nhttps://sites.google.com/view/kgssourcesbeauty/startseite\n` is not recognized as a link – Jonathan Morales Vélez Nov 08 '18 at 08:26

score 56 · Answer 2 · answered Feb 01 '15 at 23:17

56

/**
 * Returns a list with all links contained in the input
 */
public static List<String> extractUrls(String text)
{
    List<String> containedUrls = new ArrayList<String>();
    String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern pattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
    Matcher urlMatcher = pattern.matcher(text);

    while (urlMatcher.find())
    {
        containedUrls.add(text.substring(urlMatcher.start(0),
                urlMatcher.end(0)));
    }

    return containedUrls;
}

Example:

List<String> extractedUrls = extractUrls("Welcome to https://stackoverflow.com/ and here is another link http://www.google.com/ \n which is a great search engine");

for (String url : extractedUrls)
{
    System.out.println(url);
}

Prints:

https://stackoverflow.com/
http://www.google.com/

answered Feb 01 '15 at 23:17

BullyWiiPlaza

12,477
7
82
129

Downvoted because there should be eight backslashes not four. Putting them inside double quotes reduces the number of backslashes to four in the string. The regex interpretation of \\ to match a single \ reduces the number to two which is what you are trying to match. Also you can use none captureing groups, so `(?://|\\\\)` – Steve Waring May 04 '17 at 18:43
I just made the same mistake, i ment `(?://|\\\\\\\\)` – Steve Waring May 08 '17 at 18:43
Updates in regards to what? – BullyWiiPlaza Oct 30 '18 at 20:50
Big thank you. Your answer is a life-saver for regex newbies like myself. – parsecer Sep 05 '19 at 23:55

score 9 · Accepted Answer · answered Apr 19 '11 at 08:30

m.group(1) gives you the first matching group, that is to say the first capturing parenthesis. Here it's (https?|ftp|file)

You should try to see if there is something in m.group(0), or surround all your pattern with parenthesis and use m.group(1) again.

You need to repeat your find function to match the next one and use the new group array.

score 5 · Answer 4 · answered Apr 19 '11 at 08:37

Detecting URLs is not an easy task. If its enough for you to get a string that starts with https?|ftp|file then it could be fine. Your problem here is, that you have a capturing group, the () and those are only around the first part http...

I would make this part a non capturing group using (?:) and put brackets around the whole thing.

"\\b((?:https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

score 3 · Answer 5 · answered Apr 19 '11 at 08:32

3

With some extra brackets around the whole thing (except word boundary at start) it should match the whole domain name:

"\\b((https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])"

I don't think that regex matches the whole url though.

answered Apr 19 '11 at 08:32

Billy Moon

52,018
22
123
222

This works even for trailing commas and whitespaces.. Great – Abdullah Khan Aug 22 '16 at 08:46

score 0 · Answer 6 · answered Nov 18 '20 at 22:40

Old question, but this library might be useful to someone. It passes lots of test cases

https://mvnrepository.com/artifact/com.linkedin.urls/url-detector/0.1.17

Additional documentation:
https://engineering.linkedin.com/blog/2016/06/open-sourcing-url-detector--a-java-library-to-detect-and-normali

https://github.com/linkedin/URL-Detector

Alexander Yushko · Answer 7 · 2021-03-05T23:55:50.787

I tried all examples here for extracting different urls like these and neither works perfect for all:

http://example.com
https://example.com.ua
www.example.ua
https://stackoverflow.com/question/5713558/detect-and-extract-url-from-a-string
https://www.google.com/search?q=how+to+extract+link+from+text+java+example&rlz=1C1GCEU_en-GBUA932UA932&oq=how+to+extract+link+from+text+java+example&aqs=chrome..69i57j33i22i29i30.15020j0j7&sourceid=chrome&ie=UTF-8

And I wrote my regEx and a method for making it which works with text with multiple links in it:

private static final String LINK_REGEX = "((http:\\/\\/|https:\\/\\/)?(www.)?(([a-zA-Z0-9-]){2,2083}\\.){1,4}([a-zA-Z]){2,6}(\\/(([a-zA-Z-_\\/\\.0-9#:?=&;,]){0,2083})?){0,2083}?[^ \\n]*)";
private static final String TEXT_WITH_LINKS_EXAMPLE = "link1:http://example.com link2: https://example.com.ua link3 www.example.ua\n" +
        "link4- https://stackoverflow.com/questions/5713558/detect-and-extract-url-from-a-string\n" +
        "link5 https://www.google.com/search?q=how+to+extract+link+from+text+java+example&rlz=1C1GCEU_en-GBUA932UA932&oq=how+to+extract+link+from+text+java+example&aqs=chrome..69i57j33i22i29i30.15020j0j7&sourceid=chrome&ie=UTF-8";

And method which returns ArrayList with links:

 private ArrayList<String> getAllLinksFromTheText(String text) {
    ArrayList<String> links = new ArrayList<>();
    Pattern p = Pattern.compile(LINK_REGEX, Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
        links.add(m.group());
    }
    return links;
}

That's all. Call this method with TEXT_WITH_LINKS_EXAMPLE parameter and will receive five links from the text.

score 0 · Answer 8 · answered Mar 24 '21 at 13:55

0

https://github.com/linkedin/URL-Detector

        <groupId>io.github.url-detector/</groupId>
        <artifactId>url-detector</artifactId>
        <version>0.1.23</version>

answered Mar 24 '21 at 13:55

Yuriy Barannikov

111
2
5

score -1 · Answer 9 · answered Jan 31 '19 at 09:30

This little code snippet / function will effectively extract URL strings from a string in Java. I found the basic regex for doing it here, and used it in a java function.

I expanded on the basic regex a bit with the part “|www[.]” in order to catch links not starting with “http://”

Enough talk (it is cheap), here’s the code:

//Pull all links from the body for easy retrieval
private ArrayList pullLinks(String text) {
ArrayList links = new ArrayList();

String regex = "\\(?\\b(http://|www[.])[-A-Za-z0-9+&amp;@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&amp;@#/%=~_()|]";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while(m.find()) {
String urlStr = m.group();
if (urlStr.startsWith("(") &amp;&amp; urlStr.endsWith(")"))
{
urlStr = urlStr.substring(1, urlStr.length() - 1);
}
links.add(urlStr);
}
return links;
}

Detect and extract url from a string?

9 Answers9

Linked

Related