1

I’m stumped on a problem. I have a large data frame where two of the columns are like this:

pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],          ['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])

What I’m trying to do is leave only the URL including the word “twitter” left in each cell and remove the rest. The pattern is that the URLs I want always include the word “twitter” and ends with “/” + a one-digit number. In the cases where there are two identical URLs in the same cell then only one should remain. Like this:

Test2 = pd.DataFrame([['a', 'https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
 ['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
 ['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])

Test2

I’m new to Python and after a lot of googling I’ve started to understand that something called regex is the answer but that is as far as I come. One of the postings here at Stackoverflow led me to regex101.com and after playing around this is as far as I’ve come and it doesn't work:

r’^[https]+(:)(//)(.*?)(/)(\d)’

Can anyone tell me how to solve this problem? Thanks in advance.

Paolo
  • 10,935
  • 6
  • 23
  • 45
Economist
  • 97
  • 6

1 Answers1

3

Regular expressions are certainly handy for such tasks. Refer to this question and online tools such as regex101 to learn more.

Your current pattern is incorrect because:

  • ^ Matches the following pattern at the start of string.
  • [https]+ This is a character set, meaning it will match h, s, ps, therefore any combination of one or more letters present in the [] brackets, and not just the strings http and https which is what you are after.
  • (:) You don't need to put this : in a capturing group here.
  • (//) / Needs to be escaped in regex, \/. No need for capturing group here either.
  • (.*?) The .*? combo is often misused when a negated character set [^] could be used instead.
  • (/) As discussed above.
  • (\d) Matches and captures a digit. The capturing group here is also redundant for your task.

You may use the following expression:

https?:\/\/twitter\.com[^,]+(?<=\/\d$)
  • https? Matches literal substrings http or https.
  • :\/\/twitter\.com Matches literal substring ://twitter.com.
  • [^,]+ Anything that is not a comma, one or more.
  • (?<=\/\d$) Positive lookbehind. Assert that a / followed by a digit \d is present at the end of the string $.

Regex demo here.


Python demo:

import pandas as pd

df = pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
                   ['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
                   ['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])

df['URLs'] = df['URLs'].str.findall(r"https?:\/\/twitter\.com[^,]+(?<=\/\d$)").str[0]
print(df)

Prints:

  ID                                                             URLs
0  a  https://twitter.com/dog_rates/status/890971913173991426/photo/1
1  b  https://twitter.com/dog_rates/status/890971913173991426/photo/1
2  c  https://twitter.com/dog_rates/status/890971913173991430/video/1
Paolo
  • 10,935
  • 6
  • 23
  • 45