2

I keep seeing this regex in language-grammars which allow editors to highlight syntax.

I know what the regex is trying to convey:

(?!\G) Negative Lookahead - Assert that it is impossible to match the regex below
\G assert position at the end of the previous match or the start of the string for the first match

Here's the snippet which caught my attention:

console

# console.log(arg1, "arg2", [...])
'begin': '\\bconsole\\b'
'beginCaptures':
  '0':
    'name': 'entity.name.type.object.console.js'
'end': '(?!\\G)'
'patterns': [
  {
    'begin': '\\s*(\\.)\\s*(assert|clear|debug|error|info|log|profile|profileEnd|time|timeEnd|warn)\\s*(?=\\()'
    'beginCaptures':
      '1':
        'name': 'meta.delimiter.method.period.js'
      '2':
        'name': 'support.function.console.js'
    'end': '(?<=\\))'
    'name': 'meta.method-call.js'
    'patterns': [
      {
        'include': '#arguments'
      }
    ]
  }
]

The above snippet is from atom/language-javascript package.

From what I've understood by browsing various text-mate forums is that for highlighting, the editor would start at begin and go on till the end regex. Here it starts by matching the console keyword and then goes on till it matches the end regex, which I'm not able to understand, as in, where would it stop?

Could somebody explain it?

Kartik Anand
  • 3,957
  • 4
  • 39
  • 66
  • I know about look-ahead. I'm asking this in context of `\G` and specifically how language grammars interpret this – Kartik Anand Apr 13 '16 at 06:44
  • `\G` - start of a string position or the end of the previous successful match. `(?!\G)` - a location that is not the same location that can be matched with `\G`. – Wiktor Stribiżew Apr 13 '16 at 06:47
  • @WiktorStribiżew My question is more concerned with how this will effect the end match for an editor for syntax highlighting – Kartik Anand Apr 13 '16 at 06:48
  • You tagged and built the question in such a way that it does sound as a dupe. Please edit it, tag with `atom`. *What does the regex (?!\G) do?* title makes it a dupe of [What does the regex mean](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). – Wiktor Stribiżew Apr 13 '16 at 06:50
  • 1
    @WiktorStribiżew I've edited the title and tags for the question – Kartik Anand Apr 13 '16 at 06:56
  • BTW, what is the regex syntax used in there? I believe it is Oniguruma, but there is no confirmation in the docs. – Wiktor Stribiżew Apr 13 '16 at 10:43
  • The Text-Mate grammar uses Oniguruma only. Not sure about Atom though. https://manual.macromates.com/en/regular_expressions – Kartik Anand Apr 13 '16 at 10:58

1 Answers1

3

See some Language Grammars reference first:

There are two ways a rule can match the document. It can either provide a single regular expression, or two. As with the match key in the first rule above (lines 6-8), everything which matches that regular expression will then get the name specified by that rule. ... The other type of match is the one used by the second rule (lines 9-17). Here two regular expressions are given using the begin and end keys. The name of the rule will be assigned from where the begin pattern matches to where the end pattern matches (including both matches). If there is no match for the end pattern, the end of the document is used.

In this latter form, the rule can have sub-rules which are matched against the part between the begin and end matches.

Note that the regular expressions are matched against only a single line of the document at a time. That means it is not possible to use a pattern that matches multiple lines.

begin, end — these keys allow matches which span several lines and must both be mutually exclusive with the match key. Each is a regular expression pattern. begin is the pattern that starts the block and end is the pattern which ends the block.

The rules you supplied match text like console.log and highlight the 3 different parts: console, . and log.

'begin': '\\bconsole\\b'
'beginCaptures':
  '0':
    'name': 'entity.name.type.object.console.js'
'end': '(?!\\G)'

Here, the console as a whole word is matched, the whole match (as the 0th group is the whole match) is named entity.name.type.object.console.js and then the regex matches any character up to (?!\G) that matches any position that is not the end of the last successful match and not the beginning of a string. This is necessary to let the other nested rules to work, i.e. those that match '\\s*(\\.)\\s*(assert|clear|debug|error|info|log|profile|profileEnd|time|timeEnd|warn)\\s*(?=\\()' pattern. Otherwise, the block would be complete before, and the method names would be skipped from matching.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • So if I need to match only `console` which is not followed by any of the nested rules, I would have to create a separate rule for that, right? – Kartik Anand Apr 13 '16 at 12:27
  • You can, possibly with a negative lookahead that would restrict the matches to a `console` that is not followed with a `.` (say, `\\bconsole\\b(?!\\.)`), or you can add the whole subrule expression to the negative lookahead. – Wiktor Stribiżew Apr 13 '16 at 12:30
  • One more doubt!, does the last successful match as referred to by `(?!\G)` always refer to the position just after `console`. I mean, I still don't understand how does it know, that stop matching at this point. Let's say, the nested rules never match, so will it never stop? – Kartik Anand Apr 13 '16 at 15:59
  • Yes, right there, after the whole word `console` (that is, at the end of the file (=freshly types) or immediately followed with a non-word character). And since the `console` is matched, and the end of the successful match is right after `e`, the `(?!\G)` does not match, and the current `console` should be highlighted. Then, when you input a `.`, the `(?!\G)` should match right after the dot. – Wiktor Stribiżew Apr 14 '16 at 10:40
  • One thing is still don't understand is, why does it want to match the nested expressions, aren't nested expressions optional? What I mean is, that if I just write console, as in like this `foo(console)`, the editor would still want to keep matching the nested expressions. Why can't it stop? – Kartik Anand Apr 14 '16 at 10:52
  • Note that in `foo(console)` the `console` is a whole word. So, this rule you showed will trigger. After matching `\bconsole\b` matched the regex index is after `console`, and then the `(?!\G)` is run, and that matches one position ahead (since it cannot match the starting position or the position at the last successful match) – Wiktor Stribiżew Apr 14 '16 at 11:01
  • So, the editor should stop matching after it encounters the closing bracket `)`, since here the end should match. (sorry for asking a lot of questions!) – Kartik Anand Apr 14 '16 at 11:34
  • Yes, the `)` is the end of the outer block belonging to `console` since the *method* part should end right after a `)` (and potentially arguments, see `include` section). See `(?<=\\))` - *location that is right after a `)`*. So, after matching the `console.log()`, the `(?!\G)` check is run to stop matching as soon as possible after `console.log()` is found. – Wiktor Stribiżew Apr 14 '16 at 11:44
  • Tell me of I am wrong, the editor starts matching by matching the start. Then it firsts matches any of the nested patterns or else the end would match and the nested patterns would never match. Then after matching the nested patterns, it matches the end, if it matches, then it stops. But, what would happen say, the begin matches, then no nested patterns match. So would the editor just assume end to be the last non matched character? – Kartik Anand Apr 14 '16 at 11:49
  • If there is no match with the nested patterns found, the `end` pattern will be tried right at the place where `begin` stopped. And then [a character right after `console` gets matched](https://regex101.com/r/vI4qY9/1) (imagine that `console` is right behind the `.` in the demo). – Wiktor Stribiżew Apr 14 '16 at 11:54
  • See the issue here https://github.com/atom/language-javascript/issues/354 . Here, even though the end should match, still the editor keeps on trying to match the nested expressions and the syntax is broken – Kartik Anand Apr 14 '16 at 11:56
  • 1
    Yes, it happens because you added `.bind` that is not on the list of alternations. Try replacing `'begin': '\\s*(\\.)\\s*(assert|clear|debug|error|info|log|profile|profileEnd|time|timeEnd|warn)\\s*(?=\\()'` with `'begin': '\\s*(\\.)\\s*((?:assert|clear|debug|error|info|log|profile|profileEnd|time|timeEnd|warn)(?:\\.bind)?)\\s*(?=\\()'` (**COMMENT UPDATED**) – Wiktor Stribiżew Apr 14 '16 at 12:02
  • Ahh. Now all is clear. So, the issue was never with `(?\\G)` but with the nested expression. Thanks for the clear explanations! – Kartik Anand Apr 14 '16 at 12:38
  • Glad there was an issue at all that I could help fixing :) – Wiktor Stribiżew Apr 14 '16 at 12:50
  • Sorry to be a snob, but seems issue is still not fixed. Even a statement like `foo(console)` is not getting matched properly. Here, there is no bind keyword or anything :/ . Check this -> http://imgur.com/xkbu21v . I guess `(?!\\G)` is not doing it's job properly. – Kartik Anand Apr 14 '16 at 12:55
  • Yes, but that already belongs to the `\\bconsole\\b` regex. What do you think the context restriction should be? Try `\\bconsole\\b(?!\\s*,?\\s*(?:"[^\\\\"]*(?:\\\\.[^\\\\"]*)*"|[^"]+)*\\))` – Wiktor Stribiżew Apr 14 '16 at 13:00
  • It does work, but now it doesn't match `console` at all :D, I guess, two rules are needed, one when it is followed by a `.`, and one when it is not followed by a `.` – Kartik Anand Apr 14 '16 at 13:04