-1

I prepared a regular expresion in Python that searches for github webpage:

github = re.findall(
    "https?:\/\/(?:www\.)?github\.com\/[A-Za-z0-9_-]+\/?", 
text)

But now it searches for links that starts with https. How this could be modified, so the regex would search for strings that start either with https or just with www ?

Now my regex will find this:

https://github.com/helloman

as well as this:

https://www.github.com/helloman

but not this:

www.github.com/helloman

How it can be changed to accept all three options?

heisenberg7584
  • 257
  • 3
  • 18
  • The question isn't clear for me. Can you post some example URLs? – accdias Nov 23 '19 at 14:38
  • edited, hope its better now – heisenberg7584 Nov 23 '19 at 14:42
  • I tested your regex with all tree examples and it already does what you want. Can't see what is wrong. Can you clarify? – accdias Nov 23 '19 at 14:43
  • I am pretty sure it does't work for address like `www.github.com/XXX` – heisenberg7584 Nov 23 '19 at 14:44
  • So, you want to find URLs that start either with `www.` _or_ `https?://(?:www\.)?`. You can use the OR syntax to do that: `(thing)|(another thing)`. Or gather _all URLs_ with [this](https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url) and then use a URL parser (I think one is provided by `urllib`) to check the domain – ForceBru Nov 23 '19 at 14:47

2 Answers2

2

This will do the job:

(?:https?://)?(?:www[.])?github[.]com/[\w-]+/?

And here is a proof of concept:

Python 3.7.5 (default, Oct 17 2019, 12:16:48) 
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> github=re.compile('(?:https?://)?(?:www[.])?github[.]com/[\w-]+/?')
>>> github.findall('www.github.com/accdias/dotfiles.git')
['www.github.com/accdias/']
>>> github.findall('github.com/accdias/dotfiles.git')
['github.com/accdias/']
>>> github.findall('https://github.com/accdias/dotfiles.git')
['https://github.com/accdias/']
>>> github.findall('http://github.com/accdias/dotfiles.git')
['http://github.com/accdias/']
>>> github.findall('http://www.github.com/accdias/dotfiles.git')
['http://www.github.com/accdias/']
>>> github.findall('https://www.github.com/accdias/dotfiles.git')
['https://www.github.com/accdias/']
>>> 

I hope it helps.

accdias
  • 3,827
  • 2
  • 15
  • 28
  • 2
    `github.findall('www.github.com/accdias/dotfiles.git') == []`, but the OP wants a regex that _does accept_ this URL – ForceBru Nov 23 '19 at 14:48
  • Oh! Now I see. Thanks for clarifying. I was under the impression OP wanted to *exclude* those without the protocol. – accdias Nov 23 '19 at 14:51
  • 1
    IMHO `//` is clearer than `/{2}` and you've missed the hyphen, OP said `[A-Za-z0-9_-]+` that is `[\w-]+`, not `\w+` alone. – Toto Nov 23 '19 at 15:03
  • 1
    @Toto, indeed. I will update the answer. Thanks for bringing that up. – accdias Nov 23 '19 at 15:05
0

You are only missing a couple of brackets.

https://regex101.com/r/NEuD5f/2

(https:\/\/)?(www\.)?github\.com\/[A-Za-z0-9_-]+\/?

P.S.

It will now match github.com/xxx too. I'm not sure that is what you want.

abhilb
  • 5,069
  • 2
  • 14
  • 24