-3

My string looks like this: "https://google.com/bar/foobar?count=1" or it could be "https://google.com/bar/foobar"

I want to extract the value foobar - it appears after /bar and has an optional ?

My regex looks like this: m = re.match(r'(.*)/bar/(.*)((\?)(.*))?', data)

When I use this regex over example 2: "https://google.com/bar/foobar" I get two groups ('https://google.com', 'foobar', None, None, None)

When I use this regex on the first example: "https://google.com/bar/foobar?count=1" I get

('https://google.com', 'foobar?count=3', None, None, None)

But I would like the second group to just be foobar without the ?count=3 How would I achieve that?

My understanding so far is

(.*)/bar/(.*)((\?)(.*))? is as follows: (.*) matches the first part of the string. \? matches the ? and ((\?)(.*)) matches ?count=3 and this is enclosed in ? because it is supposed to be optional.

suprita shankar
  • 1,274
  • 1
  • 12
  • 32
  • 1
    `((\?)(.*))?` is optional, so the second `(.*)` grabs the rest of the string after the last `/bar/` in the string. – Wiktor Stribiżew Feb 05 '20 at 22:46
  • 1
    I would strongly suggest that you use an existing Python library to parse your URLs. That library will be already written, tested and debugged for many years and will undoubtedly cover corner cases that you might not have considered. – Andy Lester Feb 05 '20 at 23:12
  • Thanks for the suggestion. I am trying to understand how regex's work. This is not used in production. – suprita shankar Feb 05 '20 at 23:37
  • @WiktorStribiżew - Yeah the reason it is optional is because `?count=3` can be optional. – suprita shankar Feb 05 '20 at 23:40

2 Answers2

2

The * in .* of your regular expression makes it greedy. The first occurance of .* in your pattern would match till the end of the url as the rest of the matches are optional. To avoid this, you have to make your regexp non-greedy by adding a ? after *

And you need to anchor your regex pattern with a $ at the end, as otherwise the non-greedy ptrn wont match anything.

>>> data = "https://google.com/bar/foobar?count=1"
>>> re.match(r'(.*)/bar/(.*?)((\?)(.*?))?$', data).groups()
('https://google.com', 'foobar', '?count=1', '?', 'count=1')
Prem Anand
  • 2,189
  • 12
  • 16
1

Use a URL parser to extract the path component, then you can simplify your regex: .*/bar/(.*)

import re
import urllib.parse

examples = [
    "https://google.com/bar/foobar",
    "https://google.com/bar/foobar?count=1",
    ]

for ex in examples:
    path = urllib.parse.urlparse(ex).path
    result = re.search(r'.*/bar/(.*)', path)
    print(result.group(1))

Output:

foobar
foobar
wjandrea
  • 16,334
  • 5
  • 30
  • 53