4

When using BigQuery query with Data containing URLs we noticed that the DOMAIN function behaves differently from the case of the URL.

This can be demonstrated with this simple query:

SELECT
    domain('WWW.FOO.COM.AU'),
    domain(LOWER('http://WWW.FOO.COM.AU/')),
    domain('http://WWW.FOO.COM.AU/')

The result of URL of full uppercase does not seem to be right and the documentation does not mentioned anything regarding case in URLs.

Query Result

Paul Liang
  • 638
  • 6
  • 15
  • i feel like this is a bug while detecting original top-level domains like .com .net .org etc. - looks like they are lower-cased. the rest doesn't matter and is carried as is. thus same issue with TLD() – Mikhail Berlyant Jul 28 '16 at 01:04

1 Answers1

1

DOMAIN (and the other URL-handling functions in legacy SQL) have a number of limitations, unfortunately. While we don't have an equivalent yet in standard SQL (uncheck the "Use Legacy SQL" box under Options), you can make up your own that works in more cases using a regular expression. There are a number of StackOverflow questions about domain extraction, and we can put one of the answers to use as:

CREATE TEMPORARY FUNCTION GetDomain(url STRING) AS (
  REGEXP_EXTRACT(url, r'^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)'));

WITH T AS (
  SELECT url
  FROM UNNEST(['WWW.FOO.COM.AU:8080', 'google.com',
               'www.abc.xyz', 'http://example.com']) AS url)
SELECT
  url,
  GetDomain(url) AS domain
FROM T;

+---------------------+----------------+
|         url         |     domain     |
+---------------------+----------------+
| www.abc.xyz         | abc.xyz        |
| WWW.FOO.COM.AU:8080 | WWW.FOO.COM.AU |
| google.com          | google.com     |
| http://example.com  | example.com    |
+---------------------+----------------+
Community
  • 1
  • 1
Elliott Brossard
  • 26,036
  • 2
  • 41
  • 74
  • Thanks for the answer. Parsing URL in regex is fine, and most of the time I could just get away with it by lower-casing the URL before applying the function. This's not a deal breaker anyway,it's just I hope that this could be documented somewhere in the manual. – Paul Liang Jul 31 '16 at 07:59
  • That's a good suggestion; I filed https://code.google.com/p/google-bigquery/issues/detail?id=638. Thanks! – Elliott Brossard Aug 01 '16 at 15:08