1

I want to create a Regex for url in order to get all links from input string. The Regex should recognize the following formats of the url address:

  • http(s)://www.webpage.com
  • http(s)://webpage.com
  • www.webpage.com

and also the more complicated urls like: - http://www.google.pl/#sclient=psy&hl=pl&site=&source=hp&q=regex+url&pbx=1&oq=regex+url&aq=f&aqi=g1&aql=&gs_sm=e&gs_upl=1582l3020l0l3199l9l6l0l0l0l0l255l1104l0.2.3l5l0&bav=on.2,or.r_gc.r_pw.&fp=30a1604d4180f481&biw=1680&bih=935

I have the following one

((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)

but it does not recognize the following pattern: www.webpage.com. Can someone please help me to create an appropriate Regex?

EDIT: It should works to find an appropriate link and moreover place a link in an appropriate index like this:

private readonly Regex RE_URL = new Regex(@"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)", RegexOptions.Multiline);
foreach (Match match in (RE_URL.Matches(new_text)))
            {
                // Copy raw string from the last position up to the match
                if (match.Index != last_pos)
                {
                    var raw_text = new_text.Substring(last_pos, match.Index - last_pos);
                    text_block.Inlines.Add(new Run(raw_text));
                }

                // Create a hyperlink for the match
                var link = new Hyperlink(new Run(match.Value))
                {
                    NavigateUri = new Uri(match.Value)
                };
                link.Click += OnUrlClick;

                text_block.Inlines.Add(link);

                // Update the last matched position
                last_pos = match.Index + match.Length;
            }
niao
  • 4,712
  • 19
  • 62
  • 113
  • 1
    Possible duplicate: http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url – a'r Aug 18 '11 at 13:11

3 Answers3

4

I don't know why your result in match is only http:// but I cleaned your regex a bit

((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:#@%/;$()~_?\+,\-=\\.&]+)

(?:) are non capturing groups, that means there is only one capturing group left and this contains the complete matched string.

(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.) The link has now to start with something fom the first list followed by an optional www. or with an www.

[\w\d:#@%/;$()~_?\+,\-=\\.&] I added a comma to the list (otherwise your long example does not match) escaped the - (you were creating a character range) and unescaped the . (not needed in a character class.

See this here on Regexr, a useful tool to test regexes.

But URL matching is not a simple task, please see this question here

Community
  • 1
  • 1
stema
  • 80,307
  • 18
  • 92
  • 121
  • :This is what I'm taking about!. Thank you a lot. And thank you for an explanation. – niao Aug 19 '11 at 06:03
3

I've just written up a blog post on recognising URLs in most used formats such as:

www.google.com http://www.google.com mailto:somebody@google.com somebody@google.com www.url-with-querystring.com/?url=has-querystring

The regular expression used is /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)/ however I would recommend you got to http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the to see a complete working example along with an explanation of the regular expression in case you need to extend or tweak it.

Matthew O'Riordan
  • 7,503
  • 3
  • 41
  • 56
2

The regex you give doesn't work for www. addresses because it is expecting a URI scheme (the bit before the URL, like http://). The 'www.' part in your regular expression doesn't work because it would only match www.:// (which is meaningless)

Try something like this instead:

(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:#@%/;$()~_?\+-=\\\.&]*)

This will match something with a valid URI scheme, or something beginning with 'www.'

Lee Netherton
  • 18,396
  • 12
  • 57
  • 94
  • It doesn't work. I have the following code (I removed some things to be more simple) foreach (Match match in (RE_URL.Matches(new_text))) { var link = new Hyperlink(new Run(match.Value)) { NavigateUri = new Uri(match.Value) }; } and then my match.Value is http:// only – niao Aug 18 '11 at 13:19
  • @niao why don't you add your code to your question and tell us also your language by adding a language tag? – stema Aug 18 '11 at 13:27
  • @niao I'm not sure why it's not working for you. Are you using a grouping number to extract the output string? (something like 5 probably). This number will be different now. Try incrementing it by 2 (use something like 7). – Lee Netherton Aug 18 '11 at 13:28