0

I am trying to return valid URLs (as a substring) in a string in Clojurescript, what Regular Expression can I use?

(re-find #"regex for valid URL" "You can visit www.google.com")
=> "www.google.com"
(re-find #"regex for valid URL" "<b>www.google.com</b>")
=> "www.google.com"
(re-find #"regex for valid URL" "<b>www.google.com</b> and www.yahoo.com")
=> "www.google.com, www.yahoo.com"
Henry Zhu
  • 3,443
  • 5
  • 24
  • 36
  • How are you defining a valid URL? `example.com` is valid, as is `a.b.c.d.e.f.example.co.uk`. Will you support unicode characters in the domain name? Do you need support for URL-encoded strings, parameters, and subdomains? – OnlineCop Mar 31 '15 at 00:35
  • This isn't a question about clojure, the jvm, or clojurescript. You're just asking for someone to write a regex for you; and they aren't even sure what flavor of regex to use because you've added multiple conflicting language tags. – amalloy Mar 31 '15 at 00:49
  • @amalloy a regex is exactly what I'm asking for. – Henry Zhu Mar 31 '15 at 01:32
  • @OnlineCop Yeah I wasn't being too clear, and I didn't think about that before hand. I guess I was only thinking of example.com, www.example.com but nothing advanced like encoded strings. a.b.c.d.e.f.example.co.uk seems fine too. I was using this `#"^[a-zA-Z0-9\-\.]+\.(com|org|me|io|net|co|edu|uk|ca|de|jp|fr|au|us|ru|ch|it|nl|se|no|es)$"` – Henry Zhu Mar 31 '15 at 01:33
  • Well, see http://stackoverflow.com/q/161738/625403 - if what you want is a regex that matches any URL that's legal according to the IETF spec, you are kinda in for something that a bit longer than your attempt. – amalloy Mar 31 '15 at 02:11
  • 1
    goog.string.linkify.findFirstUrl: https://github.com/google/closure-library/blob/32365aba43acb36c5d693256ef5d4dbe3bddddfe/closure/goog/string/linkify_test.js#L334-L357 Given that you want multiple ones you may have to call the function in a loop and substring until nothing is found anymore. – ClojureMostly Apr 07 '15 at 00:29

1 Answers1

0

Depending on how carefully you want your script to validate the URL, the regex you provided, as long as you get rid of the '^' and '$' anchors, works fairly well (as seen here).

Note that I added some whitespace in the regex just for readability.

There are several issues that I see from that regex (as you can probably see on that page). It matches in places where it shouldn't (such as repeated .. characters), and sites with .co.uk are matching the .co portion along with the domain name and .uk separately. That, by itself, can be fairly easy to fix just simply adding those edge cases directly into the second group (the one with (com|org|...)).

The reason you'll need to remove the '^' and '$' anchors is that the pattern will only match if the URL is the only thing on the line: ^ has to match at the beginning of the line, and $ can only match at the end. Having <b>www.google.com</b> means that the <b> will make the ^ anchor fail to match the URL since it's not starting at the beginning of the line.

The other suggestions, such as @amalloy's link, gives a much more comprehensive solution and will match everything correctly, but it is very complex.

So knowing exactly what you want to match, and what you're willing to ignore/trade/give up, will help craft something that works for you.

OnlineCop
  • 3,799
  • 19
  • 33