-3

/(.*?)((http:\/\/|https:\/\/)?[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,6}(\/[a-zA-Z0-9\-\.]+)*){1}(.*?)/g

I could only make some assumptions about the above regex. But most of it is cryptic to me.

(http:\/\/|https:\/\/) - It contains either http or https protocol.

[a-zA-Z]{2,6} - Contain any of the lower or uppercase characters between 2 and 6 times.

/g - Search for it recursively

But was not able to put all of the blocks together.

Sushanth --
  • 53,795
  • 7
  • 57
  • 95

1 Answers1

2

This looks like it's trying to match full URLs.

  • (http:\/\/|https:\/\/)?, as you mentioned, looks for an optional protocol prefix
  • (.*?) at the beginning and end match anything that may be before or after the URLs.
  • [a-zA-Z0-9\-\.]+ is likely attempting to match domain names and sub-domains (e.g. test.us.domain)
  • \.[a-zA-Z]{2,6} is matches top-level domains (e.g. .com, .us, .ninja)
  • (\/[a-zA-Z0-9\-\.]+)* is looking for paths (e.g. /about, /files/my-file001.txt)
  • {1} just one

This regex has it's faults for this purpose, for example some of the segments that allow . characters (e.g. [a-zA-Z0-9\-\.]+) would allow for them multiple times in a row (i.e. a...c...d) but generally speaking this should match on URLs provided the data around them doesn't look too much like URLs.

thesquaregroot
  • 1,376
  • 1
  • 20
  • 33