1

I'm working on a gettext javascript parser and I'm stuck on the parsing regex.

I need to catch every argument passed to a specific method call _n( and _(. For example, if I have these in my javascript files:

_("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls.. 

This refs this documentation: http://poedit.net/trac/wiki/Doc/Keywords

I'm planning in doing it in two times (and two regex):

  1. catch all function arguments for _n( or _( method calls
  2. catch the stringy ones only

Basically, I'd like a Regex that could say "catch everything after _n( or _( and stop at the last parenthesis ) actually when the function is done. I dunno if it is possible with regex and without a javascript parser.

What could also be done is "catch every "string" or 'string' after _n( or _( and stop at the end of the line OR at the beginning of a new _n( or _( character.

In everything I've done I get either stuck on _( "one (optional)" ); with its inside parenthesis or apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) with two calls on the same line.

Here is what I implemented so far, with un-perfect regex: a generic parser and the javascript one or the handlebars one

whoan
  • 7,411
  • 4
  • 35
  • 46
guillaumepotier
  • 6,893
  • 8
  • 40
  • 70
  • You say you're doing a JS parser, but your attempted regex is PCRE (it's incompatible with JS since it uses a lookbehind). Which regex flavor do you intend to use? – Lucas Trzesniewski Nov 19 '14 at 17:46
  • Hi Lucas, it is a PHP parser for javascript files. So yes the regex is PCRE. Best – guillaumepotier Nov 22 '14 at 15:57
  • 1
    OK, but either way, you should use a JS parser because `you("will", encounter("unexpected", "code") || "patterns" /* or */ + "comments")` in real code. Handling this with regexes will be an unnecessary pain. – Lucas Trzesniewski Nov 23 '14 at 11:36
  • Does the pattern take in account functions inside strings of an `eval(..)` statement or inside comments? – Casimir et Hippolyte Nov 26 '14 at 13:21

6 Answers6

8

Note: Read this answer if you're not familiar with recursion.

Part 1: match specific functions

Who said that regex can't be modular? Well PCRE regex to the rescue!

~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
_n?                               # Match _ or _n
\s*                               # Optional white spaces
(?P<results>(?&brackets))         # Recurse/use the brackets pattern and put it in the results group
~sx

The s is for matching newlines with . and the x modifier is for this fancy spacing and commenting of our regex.

Online regex demo Online php demo

Part 2: getting rid of opening & closing brackets

Since our regex will also get the opening and closing brackets (), we might need to filter them. We will use preg_replace() on the results:

~           # Delimiter
^           # Assert begin of string
\(          # Match an opening bracket
\s*         # Match optional whitespaces
|           # Or
\s*         # Match optional whitespaces
\)          # Match a closing bracket
$           # Assert end of string
~x

Online php demo

Part 3: extracting the arguments

So here's another modular regex, you could even add your own grammar:

~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<array>
      Array\s*
      (?&brackets)
   )

   (?P<variable>
      [^\s,()]+        # I don't know the exact grammar for a variable in ECMAScript
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            (?&array)             # Recurse/use the array pattern
            |                     # Or
            (?&variable)          # Recurse/use the array pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&variable)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
~xis

We will loop and use preg_match_all(). The final code would look like this:

$functionPattern = <<<'regex'
~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
_n?                               # Match _ or _n
\s*                               # Optional white spaces
(?P<results>(?&brackets))         # Recurse/use the brackets pattern and put it in the results group
~sx
regex;


$argumentsPattern = <<<'regex'
~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<array>
      Array\s*
      (?&brackets)
   )

   (?P<variable>
      [^\s,()]+        # I don't know the exact grammar for a variable in ECMAScript
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            (?&array)             # Recurse/use the array pattern
            |                     # Or
            (?&variable)          # Recurse/use the array pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
|
(?&variable)
~six
regex;

$input = <<<'input'
_  ("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..

// misleading cases
_n("foo (")
_n("foo (\)", 'foo)', aa)
_n( Array(1, 2, 3), Array(")",   '(')   );
_n(function(foo){return foo*2;}); // Is this even valid?
_n   ();   // Empty
_ (   
    "Foo",
    'Bar',
    Array(
        "wow",
        "much",
        'whitespaces'
    ),
    multiline
); // PCRE is awesome
input;

if(preg_match_all($functionPattern, $input, $m)){
    $filtered = preg_replace(
        '~          # Delimiter
        ^           # Assert begin of string
        \(          # Match an opening bracket
        \s*         # Match optional whitespaces
        |           # Or
        \s*         # Match optional whitespaces
        \)          # Match a closing bracket
        $           # Assert end of string
        ~x', // Regex
        '', // Replace with nothing
        $m['results'] // Subject
    ); // Getting rid of opening & closing brackets

    // Part 3: extract arguments:
    $parsedTree = array();
    foreach($filtered as $arguments){   // Loop
        if(preg_match_all($argumentsPattern, $arguments, $m)){ // If there's a match
            $parsedTree[] = array(
                'all_arguments' => $arguments,
                'branches' => $m[0]
            ); // Add an array to our tree and fill it
        }else{
            $parsedTree[] = array(
                'all_arguments' => $arguments,
                'branches' => array()
            ); // Add an array with empty branches
        }
    }

    print_r($parsedTree); // Let's see the results;
}else{
    echo 'no matches';
}

Online php demo

You might want to create a recursive function to generate a full tree. See this answer.

You might notice that the function(){} part isn't parsed correctly. I will let that as an exercise for the readers :)

Community
  • 1
  • 1
HamZa
  • 13,530
  • 11
  • 51
  • 70
  • 1
    Looks most convoluted among other solutions, so I'll assume that it's most accurate. But does it take into account the fact that a line like `// _n(x)` is not actually a call to a function (which is stated by OP as a predicate)? – Grx70 Nov 27 '14 at 14:48
  • 1
    @Grx70 Nope, but I could take into account for situations like that by using more advanced tools like `(*SKIP)` & `(*FAIL)`. [See demo](http://regex101.com/r/zV8jG9/2). I could also write something for `/* */` but too busy at the moment for that :) – HamZa Nov 27 '14 at 15:17
1

Try this:

(?<=\().*?(?=\s*\)[^)]*$)

See live demo

Bohemian
  • 365,064
  • 84
  • 522
  • 658
  • Hi @Bohemian, I first thought you saved my life. Unfortunately I found in my templates that is quite common to have 2 translations on the same line with ternaries. I updated the post above to show you.. Thanks for your help! – guillaumepotier Nov 25 '14 at 09:21
  • Here, without $ it works almost perfectly for all, except last trad http://regex101.com/r/uD9uK1/17 This one here ((_|__|_t|_n|gettext|ngettext|dgettext))\((.*?)(?=\s*\)[^)]*$) catches almost correctly last one. But only last one. – guillaumepotier Nov 25 '14 at 10:12
0

Below regex should help you.

^(?=\w+\()\w+?\(([\s'!\\\)",\w]+)+\);

Check the demo here

Kannan Mohan
  • 1,730
  • 11
  • 15
  • Thanks @Kannan, It's almost good, but I found a vulnerability in your regex, if a string have a closing parenthesis, it wont' work, please have a look to this version: http://regex101.com/r/uD9uK1/11 Thanks – guillaumepotier Nov 24 '14 at 11:12
  • I have updated the regex which can also match closing parenthesis. In case if you need the regex to match more special character it can be added inside this group `([\s'!\\\)",\w]+)`. – Kannan Mohan Nov 24 '14 at 11:24
0

\(( |"(\\"|[^"])*"|'(\\'|[^'])*'|[^)"'])*?\)

This should get anything between a pair of parenthesis, ignoring parenthesis in quotes. Explanation:

\( // Literal open paren
    (
         | //Space or
        "(\\"|[^"])*"| //Anything between two double quotes, including escaped quotes, or
        '(\\'|[^'])*'| //Anything between two single quotes, including escaped quotes, or
        [^)"'] //Any character that isn't a quote or close paren
    )*? // All that, as many times as necessary
\) // Literal close paren

No matter how you slice it, regular expressions are going to cause problems. They're hard to read, hard to maintain, and highly inefficient. I'm unfamiliar with gettext, but perhaps you could use a for loop?

// This is just pseudocode.  A loop like this can be more readable, maintainable, and predictable than a regular expression.
for(int i = 0; i < input.length; i++) {
    // Ignoring anything that isn't an opening paren
    if(input[i] == '(') {
        String capturedText = "";
        // Loop until a close paren is reached, or an EOF is reached
        for(; input[i] != ')' && i < input.length; i++) {
            if(input[i] == '"') {
                // Loop until an unescaped close quote is reached, or an EOF is reached
                for(; (input[i] != '"' || input[i - 1] == '\\') && i < input.length; i++) {
                    capturedText += input[i];
                }
            }
            if(input[i] == "'") {
                // Loop until an unescaped close quote is reached, or an EOF is reached
                for(; (input[i] != "'" || input[i - 1] == '\\') && i < input.length; i++) {
                    capturedText += input[i];
                }
            }
            capturedText += input[i];
        }
        capture(capturedText);
    }
}

Note: I didn't cover how to determine if it's a function or just a grouping symbol. (ie, this will match a = (b * c)). That's complicated, as is covered in detail here. As your code gets more and more accurate, you get closer and closer to writing your own javascript parser. You might want to take a look at the source code for actual javascript parsers if you need that sort of accuracy.

Jack
  • 1,332
  • 1
  • 10
  • 22
0

One bit of code (you can test this PHP code at http://writecodeonline.com/php/ to check):

$string = '_("foo")
_n("bar", "baz", 42); 
_n(domain, "bux", var);
_( "one (optional)" );
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)';

preg_match_all('/(?<=(_\()|(_n\())[\w", ()%]+(?=\))/i', $string, $matches);

foreach($matches[0] as $test){
    $opArr = explode(',', $test);
    foreach($opArr as $test2){
       echo trim($test2) . "\n";
       }
    }

you can see the initial pattern and how it works here: http://regex101.com/r/fR7eU2/1

Output is:

"foo"
"bar"
"baz"
42
domain
"bux"
var
"one (optional)"
"No apples"
"%1 apple"
"%1 apples"
apples
SierraOscar
  • 16,918
  • 4
  • 36
  • 59
-1

We can do this in two steps:

1)catch all function arguments for _n( or _( method calls

(?:_\(|_n\()(?:[^()]*\([^()]*\))*[^()]*\)

See demo.

http://regex101.com/r/oE6jJ1/13

2)catch the stringy ones only

"([^"]*)"|(?:\(|,)\s*([^"),]*)(?=,|\))

See demo.

http://regex101.com/r/oE6jJ1/14

vks
  • 63,206
  • 9
  • 78
  • 110