Python: Extracting URLs using regex or other means

Question

I’m stumped on a problem. I have a large data frame where two of the columns are like this:

pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],          ['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])

What I’m trying to do is leave only the URL including the word “twitter” left in each cell and remove the rest. The pattern is that the URLs I want always include the word “twitter” and ends with “/” + a one-digit number. In the cases where there are two identical URLs in the same cell then only one should remain. Like this:

Test2 = pd.DataFrame([['a', 'https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
 ['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
 ['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])

Test2

I’m new to Python and after a lot of googling I’ve started to understand that something called regex is the answer but that is as far as I come. One of the postings here at Stackoverflow led me to regex101.com and after playing around this is as far as I’ve come and it doesn't work:

r’^[https]+(:)(//)(.*?)(/)(\d)’

Can anyone tell me how to solve this problem? Thanks in advance.

Paolo · Accepted Answer · 2018-09-02T15:49:02.223

Regular expressions are certainly handy for such tasks. Refer to this question and online tools such as regex101 to learn more.

Your current pattern is incorrect because:

^ Matches the following pattern at the start of string.
[https]+ This is a character set, meaning it will match h, s, ps, therefore any combination of one or more letters present in the [] brackets, and not just the strings http and https which is what you are after.
(:) You don't need to put this : in a capturing group here.
(//) / Needs to be escaped in regex, \/. No need for capturing group here either.
(.*?) The .*? combo is often misused when a negated character set [^] could be used instead.
(/) As discussed above.
(\d) Matches and captures a digit. The capturing group here is also redundant for your task.

You may use the following expression:

https?:\/\/twitter\.com[^,]+(?<=\/\d$)

https? Matches literal substrings http or https.
:\/\/twitter\.com Matches literal substring ://twitter.com.
[^,]+ Anything that is not a comma, one or more.
(?<=\/\d$) Positive lookbehind. Assert that a / followed by a digit \d is present at the end of the string $.

Regex demo here.

Python demo:

import pandas as pd

df = pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
                   ['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
                   ['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])

df['URLs'] = df['URLs'].str.findall(r"https?:\/\/twitter\.com[^,]+(?<=\/\d$)").str[0]
print(df)

Prints:

  ID                                                             URLs
0  a  https://twitter.com/dog_rates/status/890971913173991426/photo/1
1  b  https://twitter.com/dog_rates/status/890971913173991426/photo/1
2  c  https://twitter.com/dog_rates/status/890971913173991430/video/1

Python: Extracting URLs using regex or other means

1 Answers1