How to use RegEx to get part of redirect url?

Question

I have a column with list redirect URL on Google Custom Search Results. I would like to extract the external domain from that combined URL.

Example:

https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=https://examplesite1.co.uk/aa-vv--cc-dd-gggg-/&sa=U&ved=2ahUKEwjj1cvJ79PuAhXBHc0KHRgvBLsgQIAhAC&usg=AOvVaw2vIHUiy31YKWs5c41Q
https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=http://www.exmaplesite2.co.uk/wp-content/uploads/2016/12/research-paper.pdf&sa=U&ved=2ahUKEwiphLKMi80KHcLUCMAQFjAFegQIARAC&usg=AOvVawkm-bXjmxsPxLQ9w3
https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=https://examplesite-3.com/home/en/aaa-bbb/38376&sa=U&ved=2ahUKEwixq4K7qttXEKHTOEClsQFjAAegQIARAB&usg=AOvVaw2ouHhfNNTPV

From Above URL's, I would like to extract the external domain name

Results from above examples:

I am able to do this in Google Sheet, but need RedEx so that I can use it in Google Data Studio.

Thanks.

I am able to get the entire URL of external domains from the following regex (?<=\&q=)(.*?)(?=\&) https://regex101.com/r/0odQR7/1 Now, I am looking forward to how to get the TLD. — p.in4matics, Feb 13 '21 at 05:13
Update: with the following regex I am getting just domain names, but it is both matching www.google.com & www.site2.co.uk, I want only the second group? How can I do that? (?<=//)(.*?)(?=/) https://regex101.com/r/kbO4Wb/1 — p.in4matics, Feb 13 '21 at 05:22
Please, [update your question](https://stackoverflow.com/posts/66181929/edit) instead of put your attempts in comment. — Toto, Feb 13 '21 at 09:33
That generic reference guide is not really a dupe of this problem that OP is facing in `Google Data Studio` — anubhava, Feb 19 '21 at 10:21

score 2 · Answer 1 · answered Feb 13 '21 at 09:38

2

Just combine both regexes:

(?:(?<=&q=https://)|(?<=&q=http://))(.*?)(?=/.*?&)

answered Feb 13 '21 at 09:38

Toto

anubhava · Accepted Answer · 2021-02-16T15:18:08.540

1

You may use this regex with an additional negative lookbehind:

(?<=(?<!^https)://)[^/]+

RegEx Details:

(?<=(?<!^https)://): Positive lookbehind to assert that we have :// before current position. Additionally nested negative lookbehind (?<!^https) asserts that we don't have starting https before :// thus skipping matching starting URLs
[^/]+: Match 1+ of any character that is not /`

Update: As per comments below lookbehind is not supported in Google Data Studio, hence we can use this regex:

.https?://([^/]+)

And grab domain name from capture group #1.

. placed before https?: will ensure that we don't match a URL at the start of a line.

edited Feb 16 '21 at 15:18

answered Feb 13 '21 at 10:11

anubhava

Thanks. It worked. But unfortunately, I realized to make it work in Google Data Studio, it should be RE2. If you can help me with that, that will save my day. Thanks. – p.in4matics Feb 14 '21 at 05:42
Sorry I am not a `Google Data Studio` user. Does it support lookbehind as you have used in your attempts? – anubhava Feb 14 '21 at 05:59
No, it is not supported. – p.in4matics Feb 14 '21 at 06:03
Hmm but your regex was also using it – anubhava Feb 14 '21 at 06:04
Actually, this the first time I am using regex. I was unaware that Google only supports RE2. – p.in4matics Feb 14 '21 at 06:07
1

ok try this regex `.https?://([^/]+)` and grab capture group #1 – anubhava Feb 14 '21 at 06:19
1

Perfect. Thank You soo much. And now when I wanted to get the entire URL, not just TLD, I edit your regex to `.https?://([^&]+)`. Thanks once again. – p.in4matics Feb 15 '21 at 03:17

2 Answers2