0

I have this code for extracting urls from text in python3 :

import re

myString = "Lorem ipsum dolor sit amet https://www.lorem.com/ipsum.php?q=suas, pri omnes atomorum expetenda ex. Elit pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum lorem.org/132141222 "

print(re.search("(?P<url>https?://[^\s]+)", myString).group("url"))

the output I am getting is this :

https://www.lorem.com/ipsum.php?q=suas,

However I can not extract the second url : lorem.org/132141222

What is the best regular expression to extract all forms of urls in python. I want to extract even those incomplete and shortened URLs (without 'https' and 'www').

goodboy
  • 227
  • 1
  • 2
  • 10
  • The more permissive your expression, the greater chance of mismatches. Should it match `test.fit/data2`? How about plain `test.fit`? – Jongware Jan 29 '20 at 15:43
  • Yes it should extract the url as it is : test.fit/data2 – goodboy Jan 29 '20 at 15:49
  • 1
    URL Validation is pretty tricky with Regex. I would recommend finding a good URL validation regex, then anchoring the beginning and end to a border with `\b`, like the following: `\b(?:https?:\/\/)?(?:\w+\.)++(?:\w++)(\/\w+)*(?(1)\.\w+)?(?:\?[\w=%]+)?\b`. Note that while the regex provided does match your test cases, it is VERY BAD URL validation regex, I merely wanted to present an example that partially works. Here is a better answer not using regex: https://stackoverflow.com/a/44645124/12689629 And here is better regex URL validation: https://stackoverflow.com/a/190405/12689629 – Zaelin Goodman Jan 29 '20 at 16:14
  • You can also try `urlextract`. https://github.com/lipoja/URLExtract – Jan Lipovský Jan 08 '21 at 08:33

0 Answers0