0

I'm trying to convert this python regex to a javascript regex

https://github.com/rg3/youtube-dl/blob/a14e1538fe66c49ca8869681d2bbe60a36bd420d/youtube_dl/extractor/youtube.py#L134-L159

r"""(?x)^
(
    (?:https?://|//)?                                    # http(s):// or protocol-independent URL (optional)
    (?:(?:(?:(?:\w+\.)?[yY][oO][uU][tT][uU][bB][eE](?:-nocookie)?\.com/|
    (?:www\.)?deturl\.com/www\.youtube\.com/|
    (?:www\.)?pwnyoutube\.com/|
    (?:www\.)?yourepeat\.com/|
    tube\.majestyc\.net/|
    youtube\.googleapis\.com/)                        # the various hostnames, with wildcard subdomains
    (?:.*?\#/)?                                          # handle anchor (#/) redirect urls
    (?:                                                  # the various things that can precede the ID:
        (?:(?:v|embed|e)/)                               # v/ or embed/ or e/
        |(?:                                             # or the v= param in all its forms
            (?:(?:watch|movie)(?:_popup)?(?:\.php)?/?)?  # preceding watch(_popup|.php) or nothing (like /?v=xxxx)
            (?:\?|\#!?)                                  # the params delimiter ? or # or #!
            (?:.*?&)?                                    # any other preceding param (like /?s=tuff&v=xxxx)
            v=
        )
    ))
    |youtu\.be/                                          # just youtu.be/xxxx
    |https?://(?:www\.)?cleanvideosearch\.com/media/action/yt/watch\?videoId=
    )
)?                                                       # all until now is optional -> you can pass the naked ID
([0-9A-Za-z_-]{11})                                      # here is it! the YouTube video ID
(?(1).+)?                                                # if we found the ID, everything can follow
$"""

I removed the quotes at start and end, added start /^ and end delimiters /i, escaped forward slashes, removed the free-spacing mode and ended up with this

var VALID_URL = /^((?:https?:\/\/|\/\/)?(?:(?:(?:(?:\w+\.)?[yY][oO][uU][tT][uU][bB][eE](?:-nocookie)?\.com\/|(?:www\.)?deturl\.com\/www\.youtube\.com\/|(?:www\.)?pwnyoutube\.com\/|(?:www\.)?yourepeat\.com\/|tube\.majestyc\.net\/|youtube\.googleapis\.com\/)(?:.*?\#\/)?(?:(?:(?:v|embed|e)\/)|(?:(?:(?:watch|movie)(?:_popup)?(?:\.php)?\/?)?(?:\?|\#!?)(?:.*?&)?v=)))|youtu\.be\/|https?:\/\/(?:www\.)?cleanvideosearch\.com\/media\/action\/yt\/watch\?videoId=))?([0-9A-Za-z_-]{11})(?(1).+)?$/g;

However the javascript regex debugger I'm using says Unexpected character "(" after "?" in regards to the javascript transpose of this part of the python regex

(?(1).+)?      # if we found the ID, everything can follow

Any idea how I can resolve this error?

zx81
  • 38,175
  • 8
  • 76
  • 97
user784637
  • 13,012
  • 31
  • 83
  • 144

1 Answers1

1

JavaScript does not support conditionals.

But the world of regex has long survived without conditionals, and there are ways around it.

The Idea

The basic structure of that scary regex was this:

(Capture A)? (Match B) ( If A was captured, (Match C)? )

You can translate the IF into an OR:

(Capture A) (Match B) (Match C)? **OR** (Match B)

Converted Regex

Try this:

^((?:https?://|//)?(?:(?:(?:(?:\w+\.)?[yY][oO][uU][tT][uU][bB][eE](?:-nocookie)?\.com/|(?:www\.)?deturl\.com/www\.youtube\.com/|(?:www\.)?pwnyoutube\.com/|(?:www\.)?yourepeat\.com/|tube\.majestyc\.net/|youtube\.googleapis\.com/)(?:[^\n]*?#/)?(?:(?:(?:v|embed|e)/)|(?:(?:(?:watch|movie)(?:_popup)?(?:\.php)?/?)?(?:\?|#!?)(?:[^\n]*?&)?v=)))|youtu\.be/|https?://(?:www\.)?cleanvideosearch\.com/media/action/yt/watch\?videoId=)([0-9A-Za-z_-]{11})(?:[^\n]+)?)|^([0-9A-Za-z_-]{11})

Explanation

The (?(1)[^\n]+)? conditional tries to optionally match [^\n]+ if Group 1 is set. Since it occurs after the non-optional ([0-9A-Za-z_-]{11}), I transformed the conditional into an alternation |

  • I make no judgment about the suitability of the regex... I rearranged the "grammar" without looking at the "words". :)
  • Either we match that whole Group 1, into which we now directly roll the ([0-9A-Za-z_-]{11}) and the optional component, OR
  • We directly match the ([0-9A-Za-z_-]{11})
  • If you are interested in retrieving the ([0-9A-Za-z_-]{11}), depending on which side of the alternation matches it, it will live inside a different capture Group. I'll leave you to count the parentheses.
  • There are probably lots of parentheses you can remove, depending on your needs

Reference

zx81
  • 38,175
  • 8
  • 76
  • 97
  • 1
    Terrific, glad it worked. FYI, I posted some references about conditionals. See you next time! :) – zx81 Jun 19 '14 at 06:01
  • Thanks again, this is a good website I've been using to visualize how a regex works https://www.debuggex.com/. – user784637 Jun 19 '14 at 06:04
  • I was a little confused about something. I created a jsfiddle and noticed one use case of a url is passing when it should not. http://jsfiddle.net/hd6E3/2/ We expect ids with only `[0-9A-Za-z_-]` to work but this url passes `http://www.youtube.com/watch?v=iNu+bNAiaSA&feature=youtube_gdata_player` even though it has a `+` symbol in the video id – user784637 Jun 19 '14 at 06:22
  • 1
    FYI, added an explanation about the logic of the translation. Will take a look at your fiddle... But I was interested in the translation, not analyzing the regex. :) – zx81 Jun 19 '14 at 06:28
  • 1
    By the way have you seen this highly upvoted post in the FAQ about [matching YouTube IDs](http://stackoverflow.com/questions/5830387/how-to-find-all-youtube-video-ids-in-a-string-using-a-regex)? – zx81 Jun 19 '14 at 06:30
  • 1
    Just so you know, I'm looking at it. Not sure what's going on yet. :) – zx81 Jun 19 '14 at 07:24
  • 1
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/55892/discussion-between-zx81-and-user784637). – zx81 Jun 19 '14 at 07:28
  • 1
    GOT IT!!! I hadn't noticed the `^` at the very top, before the huge parenthesis. It applies to the whole regex, so it needs to be distributed on each side of the OR. You know, as in 2*(3+4) = 2*3 + 2*4. Without it, the right side of the OR matched in an unanchored position. Added it in the right place on the answer. :) – zx81 Jun 19 '14 at 07:32