2

I'm trying hard to write a regex that should catch any http address. (background: I'd like to use it in a tkinter window, a simple editor, to transform an http address into a clickable link) Due to how complicated they can be, which is the better regex?

alessandro

alessandro
  • 3,330
  • 6
  • 37
  • 52

3 Answers3

1

Considering the possibilities that came with Punycode, I'd say this is almost impossible to do with a RegEx.

Of course you could restrict your view to ASCII URLs.

You should take a look at the Regular Expression Library.

primfaktor
  • 2,569
  • 23
  • 33
1

Using A regex that validates a web address and matches an empty string? as a basis for an answer.

Assuming that an HTTP (or HTTPS) address :

  • starts with "http://" or "https://"
  • contains at least one "." between the TLD and the domain name
  • the domain name is composed of letters, numbers _ and -
  • the URL is delimited at the end by a space and can contain any other character

then the regular expression could be '(http|https)://[\w-]+(.[\w-]+)+\S*'

>>> import re
>>> re.sub("(http|https)://[\w\-]+(\.[\w\-]+)+\S*", "### URL ###", "There is an URL in this string : https://stackoverflow.com/questions/6532089/regex-to-catch-any-http-address and it is followed by text")
'There is an URL in this string : ### URL ### and it is followed by text'

But it doesn't detect a punctuation after the URL.

Community
  • 1
  • 1
Teg
  • 31
  • 2
  • Your answer will match, but completely different than you expect. `/*.*` means match any amount of `/` and then any amount of any character. `.` is a special character in regex and means ANY character. Your regex will match e.g. `http:/` – stema Jun 30 '11 at 08:55
  • Your expression matches input like `http:/`, `http:/not-a-url` and `http:///////////////`. It would also catch all whitespace, so as soon as a URL is typed in the OP's editor window, it would never end! – anton.burger Jun 30 '11 at 08:57
  • For the intended effect (from Teg's prose I take it that he was trying to specify a glob instead of a regex), try "http://[^/.]+\.[^/.]+" - not that I recommend this as a way of recognizing links – Sasha Jun 30 '11 at 09:08
  • welcome to StackOverflow. Answers here are reviewed very quickly, here. But you have the possibility to edit and improve your answer. If you change it into something that is not wrong, I am able to take back my downvote, if its good I will give you an upvote. But don't be discouraged, SO is a great place to learn and you already got hints in the comments to your answer. – stema Jun 30 '11 at 09:25
  • I believe I do answer now. The regular expression doesn't verify a valid URL but detects what should be interpreted as such in a string. It still can be improved by detecting punctuation. – Teg Jun 30 '11 at 12:03
1

In tornado.escape module is nice method "linkify" for that. You can view source here:escape.py ps: I wanted to add this post as comment, but i dont have enough privileges, but anyway i hope you found it useful.

timgluz
  • 1,014
  • 10
  • 14