7

In Perl, when one wants to do continuous parsing on a string, it can be done something like this my $string = " a 1 # ";

while () {
    if ( $string =~ /\G\s+/gc )    {
        print "whitespace\n";
    }
    elsif ( $string =~ /\G[0-9]+/gim ) {
        print "integer\n";
    }
    elsif ( $string =~ /\G\w+/gim ) {
        print "word\n";
    }
    else {
        print "done\n";
        last;
    }
}

Source: When is \G useful application in a regex?

It produces the following output:

whitespace
word
whitespace
integer
whitespace
done

In JavaScript (and many other regular expressions flavors) there is no \G pattern, nor any good replacement.

So I came up with a very simple solution that serves my purpose.

<!-- language: lang-js --> 
//*************************************************
// pattmatch - Makes the PAT pattern in ST from POS
// notice the "^" use to simulate "/G" directive
//*************************************************
function pattmatch(st,pat,pos)
{
var resu;
pat.lastIndex=0;
if (pos===0)  
    return  pat.exec(st);    // busca qualquer identificador  
else  {
  resu = pat.exec(st.slice(pos));    // busca qualquer identificador  
  if (resu) 
      pat.lastIndex = pat.lastIndex + pos;
  return resu;
}  // if

}

So, the above example would look like this in JavaScript (node.js):

<!-- language: lang-js -->
var string = " a 1 # ";
var pos=0, ret;  
var getLexema  = new RegExp("^(\\s+)|([0-9]+)|(\\w+)","gim");  
while (pos<string.length && ( ret = pm(string,getLexema,pos)) ) {
    if (ret[1]) console.log("whitespace");
    if (ret[2]) console.log("integer");
    if (ret[3]) console.log("word");
    pos = getLexema.lastIndex;
}  // While
console.log("done");

It produces the same output than Perl code snippet:

whitespace
word
whitespace
integer
whitespace
done

Notice the parser stop at # character. One can continue parsing in another code snippet from pos position.

Is there a better way in JavaScript to simulate Perl's /G regex pattern?

Post edition

For curiosity, I've decided to compare my personal solution with @georg proposal. Here I do not state which code is best. For me, tt's a matter of taste.

It will my system, which will depend a lot on user interaction, become slow?

@ikegami writes about @georg solution:

... his solution adds is a reduction in the number of times your input file is copied ...

So I've decided compare both solutions in a loop that repeats the code code 10 million times:

<!-- language: lang-js -->
var i;
var n1,n2;
var string,pos,m,conta,re;

// Mine code
conta=0;
n1 = Date.now();
for (i=0;i<10000000;i++) {
  string = " a 1 # ";
  pos=0, m;  
  re  = new RegExp("^(\\s+)|([0-9]+)|(\\w+)","gim");  
  while (pos<string.length && ( m = pattMatch(string,re,pos)) ) {
    if (m[1]) conta++;
    if (m[2]) conta++;
    if (m[3]) conta++;
    pos = re.lastIndex;
  }  // While
}
n2 = Date.now();
console.log('Mine: ' , ((n2-n1)/1000).toFixed(2), ' segundos' );


// Other code
conta=0;
n1 = Date.now();

for (i=0;i<10000000;i++) {
  string = " a 1 # ";
  re  = /^(?:(\s+)|([0-9]+)|(\w+))/i;
  while (m = string.match(re)) {
   if (m[1]) conta++;
   if (m[2]) conta++;
   if (m[3]) conta++;
   string = string.slice(m[0].length)
 }
 }
n2 = Date.now();
console.log('Other: ' , ((n2-n1)/1000).toFixed(2) , ' segundos');

//*************************************************
// pattmatch - Makes the PAT pattern in ST from POS
// notice the "^" use to simulate "/G" directive
//*************************************************
function pattMatch(st,pat,pos)
{
var resu;
pat.lastIndex=0;
if (pos===0)  
    return  pat.exec(st);    
else  {
  resu = pat.exec(st.slice(pos)); 
  if (resu) 
      pat.lastIndex = pat.lastIndex + pos;
  return resu;
}  
} // pattMatch

Results:

Mine: 11.90 segundos
Other: 10.77 segundos

My code runs about 10% longer. It spends about 110 nanoseconds more per iteration.

Honestly, according to my personal preference, I accept this loss of efficiency as acceptable to me, in a system with heavy user interaction.

If my project involved heavy mathematical processing with multidimensional arrays or gigantic neural networks, I might rethink.

HoldOffHunger
  • 10,963
  • 6
  • 53
  • 100
Paulo Buchsbaum
  • 1,863
  • 20
  • 21

2 Answers2

4

The functionality of \G exists in form of the /y flag.

var regex = /^foo/y;
regex.lastIndex = 2;
regex.test('..foo');   // false - index 2 is not the beginning of the string

var regex2 = /^foo/my;
regex2.lastIndex = 2;
regex2.test('..foo');  // false - index 2 is not the beginning of the string or line
regex2.lastIndex = 2;
regex2.test('.\nfoo'); // true - index 2 is the beginning of a line

But it's quite new. You won't be able to use it on public web sites yet. Check the browser compatibility chart in the linked documentation.

ikegami
  • 322,729
  • 15
  • 228
  • 466
  • 1
    I didn't know it, @ikegami, thank you for you knowledge sharing! Unfortunately I cannot use this feature in my project, which requires some compatibility with older versions. So the question is: Is there any more optimized way (apart mine) to simulate / G without this new feature? – Paulo Buchsbaum Aug 01 '17 at 15:35
  • Nowadays I do not think that saving a function call within a loop is so relevant. For the processing type of my application, I'd rather pay the price for it than to cut the string successively, as one goes through processing. This is because, in my actual process, I want to save some positions inside string for further processing and the relative position takes off the clarity, even though I could rescue the absolute positions. So the / Y clause you cited, and the / G clause has its value. – Paulo Buchsbaum Aug 01 '17 at 17:46
  • I've edited my javascript code in order to have the same number of lines than the @georg javascript code. The difference is one additional function call and my option to keep the integrity of index and pattern match target. – Paulo Buchsbaum Aug 01 '17 at 17:53
  • Re "*Nowadays I do not think that saving a function call within a loop is so relevant.*", You really missed the point of their solution if you think it has to do with function calls?!? What his solution adds is a reduction in the number of times your input file is copied. – ikegami Aug 01 '17 at 17:56
  • Sorry for my ignorance, but which input file do you mean? The source code? Is the string, regex target, no would be already in memory? The routine that performs pattern recognition (_exec_) is called the same number of times in both cases, in one case directly, the other in the function _pattmatch_. My edited solution has the same number of lines as the other. The loop executes the same number of times. It must be something obvious that is escaping me, but I can not understand what you're talking about. – Paulo Buchsbaum Aug 01 '17 at 18:20
  • The input to a tokenizer is the string to tokenize. You keep making partial copies of it using slice. That makes your JavaScript-based tokenizer O(N^2). In contrast, the Perl-based tokeniser is O(N). – ikegami Aug 01 '17 at 18:58
2

Looks like you're overcomplicating it a bit. exec with the g flag provides anchoring out of the box:

var 
    string = " a 1 # ",
    re  = /(\s+)|([0-9]+)|(\w+)|([\s\S])/gi,
    m;

while (m = re.exec(string)) {
    if (m[1]) console.log('space');
    if (m[2]) console.log('int');
    if (m[3]) console.log('word');
    if (m[4]) console.log('unknown');    
}

If your regexp is not covering, and you want to stop on the first non-match, the simplest way would be to match from the ^ and strip the string once matched:

    var 
        string = " a 1 # ",
        re  = /^(?:(\s+)|([0-9]+)|(\w+))/i,
        m;

    while (m = string.match(re)) {
        if (m[1]) console.log('space');
        if (m[2]) console.log('int');
        if (m[3]) console.log('word');
        string = string.slice(m[0].length)
    }

    console.log('done, rest=[%s]', string)

This simple method doesn't fully replace \G (or your "match from" method), because it loses the left context of the match.

georg
  • 195,833
  • 46
  • 263
  • 351
  • I can have miss something, but I do not think it does what I want. It skips **"#"** character and scans the space character after. I don't want it, because I want to be able to scan # in another snippet of code. In addition, this solution ignores embedded pieces that don't fit, which violates the concept of continuous parsing in a compiler ou transpiler. For instance, if string is **"& a & 1 # "**, the result will be **space word space space int space space** That would violate my "syntax" I have created above. If not, **/G** would be useless in Perl. – Paulo Buchsbaum Aug 01 '17 at 14:55
  • @PauloBuchsbaum: sure, a real world regex should cover all possible cases, with the last (fallback) group `([\s\S])`. See update – georg Aug 01 '17 at 15:02
  • it's hard to explain. The above regex is just for illustration. My real app is much more complicated than this, but I will try to elucidate below: My real goal is stop the pointer at the first part of string that does not fit this pattern matching, so in the following code snippet, I'll deal it with another regex. In this case (string **"a 1 # "**), I need to stop in "#" character and not skip the unwanted parts and proceed with the same regex. After all, Perl must have added **"/G"** in regex spec for some reason. – Paulo Buchsbaum Aug 01 '17 at 15:27
  • @PauloBuchsbaum: I see where you're coming from, updated the answer. – georg Aug 01 '17 at 15:43
  • Yep, nice, thank you, @georg . It works flawless, however I still prefer my approach because the regex is a little bit simpler and the string doesn't need to be changed in each iteration. The change is encapsulated and local within the routine pattmatch. I will edit my code to make it simpler. – Paulo Buchsbaum Aug 01 '17 at 15:54
  • Re "*I still prefer my approach*", Your approach does far more copying. – ikegami Aug 01 '17 at 15:57
  • I have edited my code. Both are just as simple, it's just a matter of taste. – Paulo Buchsbaum Aug 01 '17 at 16:09
  • Except for the function, both solutions have the same number of lines of code. One chose to move the pointer, the other chose to successively change the pattern recognition target. In fact, your solution is a bit faster because it avoids an additional function call in each interaction. – Paulo Buchsbaum Aug 01 '17 at 16:46
  • Re "*Both are just as simple, it's just a matter of taste*", The whole point was to get the efficiency of `/\G.../gc`, and that's not a matter of taste. – ikegami Aug 01 '17 at 17:59
  • Perl can be better than Javascript for some uses. I have no experience in that language, but I'm using Ionics that uses Angular, that uses Javascript. So I have no choice for now! And as I've shown in post-editing, the time difference for each execution cycle is of the order of 10%, which corresponds, in my environment, to about 100 nanoseconds. – Paulo Buchsbaum Aug 01 '17 at 20:36