-2

I have a column with list redirect URL on Google Custom Search Results. I would like to extract the external domain from that combined URL.

Example:

  1. https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=https://examplesite1.co.uk/aa-vv--cc-dd-gggg-/&sa=U&ved=2ahUKEwjj1cvJ79PuAhXBHc0KHRgvBLsgQIAhAC&usg=AOvVaw2vIHUiy31YKWs5c41Q

  2. https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=http://www.exmaplesite2.co.uk/wp-content/uploads/2016/12/research-paper.pdf&sa=U&ved=2ahUKEwiphLKMi80KHcLUCMAQFjAFegQIARAC&usg=AOvVawkm-bXjmxsPxLQ9w3

  3. https://www.google.com/url?client=internal-element-cse&cx=3c360356&q=https://examplesite-3.com/home/en/aaa-bbb/38376&sa=U&ved=2ahUKEwixq4K7qttXEKHTOEClsQFjAAegQIARAB&usg=AOvVaw2ouHhfNNTPV

From Above URL's, I would like to extract the external domain name

Results from above examples:

  1. www.site2.co.uk
  2. www.exmaplesite2.co.uk
  3. examplesite-3.com

I am able to do this in Google Sheet, but need RedEx so that I can use it in Google Data Studio.

Thanks.

horcrux
  • 4,954
  • 5
  • 24
  • 35
p.in4matics
  • 87
  • 1
  • 9
  • 1
    I am able to get the entire URL of external domains from the following regex (?<=\&q=)(.*?)(?=\&) https://regex101.com/r/0odQR7/1 Now, I am looking forward to how to get the TLD. – p.in4matics Feb 13 '21 at 05:13
  • Update: with the following regex I am getting just domain names, but it is both matching www.google.com & www.site2.co.uk, I want only the second group? How can I do that? (?<=//)(.*?)(?=/) https://regex101.com/r/kbO4Wb/1 – p.in4matics Feb 13 '21 at 05:22
  • 5
    Please, [update your question](https://stackoverflow.com/posts/66181929/edit) instead of put your attempts in comment. – Toto Feb 13 '21 at 09:33
  • That generic reference guide is not really a dupe of this problem that OP is facing in `Google Data Studio` – anubhava Feb 19 '21 at 10:21

2 Answers2

2

Just combine both regexes:

(?:(?<=&q=https://)|(?<=&q=http://))(.*?)(?=/.*?&)

Demo & explanation

Toto
  • 83,193
  • 59
  • 77
  • 109
1

You may use this regex with an additional negative lookbehind:

(?<=(?<!^https)://)[^/]+

RegEx Demo

RegEx Details:

  • (?<=(?<!^https)://): Positive lookbehind to assert that we have :// before current position. Additionally nested negative lookbehind (?<!^https) asserts that we don't have starting https before :// thus skipping matching starting URLs
  • [^/]+: Match 1+ of any character that is not /`

Update: As per comments below lookbehind is not supported in Google Data Studio, hence we can use this regex:

.https?://([^/]+)

And grab domain name from capture group #1.

. placed before https?: will ensure that we don't match a URL at the start of a line.

anubhava
  • 664,788
  • 59
  • 469
  • 547