How do you find 2 groups in a URL using regex?

Question

I am not a programmer by trade but I need to create a rule in Google Tag Manager using regex. My goal is to look at a URL and find two separate group matches in the URL. Here is a sample URL

http://123.website.com/?&guid=blahblahblah&page=something&type=abc&adv=abc1234&site={siteID}

I originally had this regex below which worked great if it weren't for the "&guid=blahblahblah&page=something&" in between the two groups. How do I check for those two groups in one expression?

(http:\/\/)(([0-9])|([0-9][0-9])|([0-9][0-9][0-9]))\.website\.com\?(type\=abc)

Bonus: How can I make it check for https as well as http?

thx!

Sure you want to do all that in a single regex? Which language do you use, maybe it has a more convenient feature to handle urls? — arkascha, Apr 18 '14 at 18:01
Your question isn't very clear... Can you edit it to explain what exactly you want to extract from the URL (with examples of input/output)? — Robin, Apr 18 '14 at 18:06
To match `http` or `https`, use `https?` which will make the `s` optional. — Sam, Apr 18 '14 at 18:17
This answer from the [Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496), listed under "Common Validation Tasks", may be of interest: [Using regex to validate a url](http://stackoverflow.com/a/190405/2736496). — aliteralmind, Apr 18 '14 at 19:11

Pedro Lobito · Answer 1 · 2014-04-18T20:05:44.513

0

It's actually easier than you think:

/https?:\/\/([\d]{1,4})\.website\.com\/.*?&type=(.*?)&.*?/

http://regex101.com/r/nU5yP2

edited Apr 18 '14 at 20:05

answered Apr 18 '14 at 19:34

Pedro Lobito

75,541
25
200
222

Cool. Looks like this will catch type=abcdefg too. Forgot to mention that type= is sometime longer than 3 characters. – MixedBeans Apr 18 '14 at 19:54
You want to get the `subdomain` and the `type`, right ? – Pedro Lobito Apr 18 '14 at 19:56
Any situations where this fails? The number portion is not likely to exceed 4 digits and will only be numbers. That's important. I don't want matches for letters there. They type= will vary in length and may contain a mix of letters and numbers. – MixedBeans Apr 18 '14 at 20:03
Great! Did some testing and I found out I need to be able to hard code in the value of "type=" as in "type=abc" instead of just finding the value. Make sense? It needs to be an exact match wheres as the numbers do not. They just need to be present at the beginning. – MixedBeans Apr 18 '14 at 22:36
Man! Found a place where this fails. Not becuase you code doesn't work but because I need to make an additional match with "adv=abc1234" where abc1234 is a specific value. So far I changed your code which seems to work but I need to extend it to include "adv=" Here is what I have. Quite possible this is poor form. https?:\/\/([\d]{1,4})\.website\.com\/.*?&type=(abc) – MixedBeans Apr 19 '14 at 02:16

Mofi · Accepted Answer · 2014-04-21T15:22:28.380

-1

After your comments on first version of this answer, I read on page Tags, Rules, Macros, and the Data Layer of Google Tag Manager a little about rules.

Obviously you want a rule which returns true if the URL

starts with http:// or https://,
any number with 1 to 3 digits,
.website.com/,
and contains also type=abc within the url.

I can't test this, but the following rule should work:

{{url}} matches RegEx https?://\d{1,3}\.website\.com/.*type=abc.*adv=.*

The regular expression engine of Google Tag Manager hopefully supports those basic regular expressions from Perl regular expression language set.

Explanation:

http is a fixed string which must exist in the URL at beginning.

As Sam wrote, the question mark after s makes the existence of s optional.

:// is again a fixed string which must exist in the URL after http or https.

\d{1,3} matches any digit (0-9) at least once, but not more than 3 times. So it matches numbers from 0 to 999. Any other character or more digits results in false for the rule.

\.website\.com/ is again a fixed string whereby the point is interpreted as literal character.

.* matches twice any character of the URL 0 or more times.

edited Apr 21 '14 at 15:22

answered Apr 18 '14 at 18:26

Mofi

38,783
14
62
115

Ok. I want to be able to check for the following: http://123.website.com/ where "123" is a random number that can be a single digit or 3 digits and and exact match of type=abc – MixedBeans Apr 18 '14 at 18:39
Maybe it is two separate rules. If that is the case then the answer is easy I think. (https?:\/\/)(([0-9])|([0-9][0-9])|([0-9][0-9][0-9]))(\.tynon\.com) and another rule for type=abc – MixedBeans Apr 18 '14 at 18:45
Sorry I missed your answer UltraEdit. I have no idea if it supports perl expressions but your answer seems to be what I am looking to do. My goal now is to expand it to include looking for a match of a third group "adv=" – MixedBeans Apr 19 '14 at 17:49
If parameter `adv=` is always after parameter `type=abc` in the URLs, simply append `adv=` to end of the regular expression after `.*`. Or alternatively insert before `.*` at end of the expression the string `&adv=` if the string `type=abc&adv=` always in the URLs in this form. – Mofi Apr 20 '14 at 15:22
For some reason Google kicks this out {{url}} matches RegEx https?:\/\/([1-9]\d{0,3})\.website\.com\/.*type=abc&adv=abc1234 . This rule doesn't seem to work. Think I need to escape the ampersand? – MixedBeans Apr 21 '14 at 14:29
The Wikipedia article about [Regular expression](http://en.wikipedia.org/wiki/Regular_expression) is also referenced from documentation about **Google Tag Manager**. The ampersand as well as the forward slash are characters with no special regular expression meaning and therefore do not need to be escaped. On the other side escaping `&` and `/` also doesn't matter. But if **Google Tag Manager** converts `&` to `&`, it is better to use simply `.` (any character) instead of `&` in the expression or even better `.*` to match `&` and `&` and everything else between `type=abc`and `adv=`. – Mofi Apr 21 '14 at 15:16

How do you find 2 groups in a URL using regex?

2 Answers2