3

I'm using the following Regex to match all types of URL in PHP (It works very well):

 $reg_exUrl = "%\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))%s";

But now, I want to exclude Youtube, youtu.be and Vimeo URLs:

I'm doing something like this after researching, but it is not working:

$reg_exUrl = "%\b(([\w-]+://?|www[.])(?!youtube|youtu|vimeo)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))%s";

I want to do this, because I have another regex that match Youtube urls which returns an iframe and this regex is causing confusion between the two Regex.

Any help would be gratefully appreciated, thanks.

ecodL
  • 43
  • 7
  • Why don't you just filter out unwanted domains in a second step? or even more better: third step. The second step should be URL normalization. Then it's pretty easy and much more robust. – hakre Apr 21 '14 at 22:38
  • FYI The original answer was general, but I added a regex specifically for your situation. – zx81 Apr 21 '14 at 22:54
  • Thanks for your comment @hakre, but I don't know too much about regular expressions and happens that I have a comment system, then, what I want to do is to detect all the urls (to make them clickable with "href"), hashtags (search) and urls of youtube (iframe), and I have to do all this at the same time when I get the data from the database to show it finally to the user. – ecodL Apr 21 '14 at 22:56

1 Answers1

3

socodLib, to exclude something from a string, place yourself at the beginning of the string by anchoring with a ^ (or use another anchor) and use a negative lookahead to assert that the string doesn't contain a word, like so:

^(?!.*?(?:youtube|some other bad word|some\.string\.with\.dots))

Before we make the regex look too complex by concatenating it with yours, let;s see what we would do if you wanted to match some word characters \w+ but not youtube or google, you would write:

^(?!.*?(?:youtube|google))\w+

As you can see, after the assertion (where we say what we don't want), we say what we do want by using the \w+

In your case, let's add a negative lookahead to your initial regex (which I have not tuned):

$reg_exUrl = "%(?i)\b(?!.*?(?:youtu\.?be|vimeo))(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))%s";

I took the liberty of making the regex case insensitive with (?i). You could also have added i to your s modifier at the end. The youtu\.?be expression allows for an optional dot.

I am certain you can apply this recipe to your expression and other regexes in the future.

Reference

  1. Regex lookarounds
  2. StackOverflow regex FAQ
Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97