0

I've read many Questions on StackOverflow, including this one, this one, and even read Rexegg's Best Trick, which is also in a question here. I found this one, which works on entire lines, but not "everything up to the bad word". None of these have helped me, so here I go:

In Javascript, I have a long regex pattern. I'm trying to match a sequence in similar sentence structures, like follows:

1 UniquePrefixA [some-token] and [some-token] want to take [some-token] to see some monkeys.

2 UniqueC [some-token] wants to take [some-token] to the store. UniqueB, [some-token] is in the pattern once more.

3 UniquePrefixA [some-token] is using [some-token] to [some-token].

Notice that each pattern starts with a unique prefix. Encountering that prefix signals the start of a pattern. If I encounter that pattern again during capture, I should not capture a second occurance, and STOP THERE. I'll have captured everything up to that prefix.

If I don't encounter the prefix later in the pattern, I need to continue matching that pattern.

I'm also using capture groups (not repeating, since Capture Groups only return the last matched of that group). The capture group contents need to be returned, so I'm using match, non-greedy.

Here's my pattern and a working example

/(?:UniquePrefixA|UniqueB|UniqueC)\s*(\[some-token\])(?:and|\s)*(\[some-token\])?(\s|[^\[\]])*(\[some-token\])? --->(\s|[^\[\]])*<--- (\[some-token\])?(\s|[^\[\]])*/i

It's basically 2 repeating patterns in a specific order:

(\s|[^\[\]])*     // Basicaly .*, but excluding brackets
(\[some-token\])  // A token [some-token]

How I can prevent the match from continuing past a black list of words?

I want this to happen where I drew three arrows, for context. The equivalent of Any character, but not the contents of this list: (UniquePrefixA|UniqueB|UniqueC) (as seen in capture group 1).

It's possible I need a better understanding of negative lookahead, or if it can work with a group of things. Most importantly, I'm looking to know if a negative look-ahead approach can support a list of options Or is there a better way altogether? If the answer is "you can't do that," that's cool too.

Community
  • 1
  • 1
Atomox
  • 522
  • 4
  • 15

2 Answers2

1

I think, an easier to maintain solution is to divide your task into 2 parts:

  1. Find each chunk of text starting from any of your unique prefixes, up to the next or to the end of string.

  2. Process each such chunk, looking for your some tokens and maybe also the content between them.

The regex performing the first task should include 3 parts:

  • (?:UniquePrefixA|UniqueB|UniqueC) - A non-capturing group looking for any unique prefix.
  • ((?:.|\n)+?) - A capturing group - the fragment to catch for further processing (see the note below).
  • (?=UniquePrefixA|UniqueB|UniqueC|$) - A positive lookahead, looking for either any unique prefix or the end of the string (a stop criterion you are looking for).

To sum up, the whole regex looks like below:

/(?:UniquePrefixA|UniqueB|UniqueC)((?:.|\n)+?)(?=UniquePrefixA|UniqueB|UniqueC|$)/gi

Note: Unfortunately, JavaScript flavour of regex does not implement single-line (-s) option. So, instead of just . in the capturing group above, you must use (?:.|\n), meaning:

  • either any char other than \n (.),
  • or just \n.

Both these variants are "enveloped" into a non-capturing group, to put limits of variants (both sides of |), because the repetition marker (+?) pertains to both variants.

Note ? after +, meaning the reluctant version.

So this part of regex (the capturing group) will match any sequence of chars including \n, ending before the next uniqie prefix (if any), just as you expect.

The second task is to apply another regex to the captured chunk (group 1), looking for [some-token]s and possibly the content between them. You didn't specify what you want exactly do with each chunk, so I'm not sure what this second regex shoud include. Maybe it will be enough just to match [some-token]?

Valdi_Bo
  • 24,530
  • 2
  • 17
  • 30
  • This is definitely a well thought out post, and might help get me going. I've started responding two or three times, and each one gets me thinking. – Atomox Jan 04 '18 at 18:18
  • Consider: ( A . . . B . . . C, (GATE) then also D . . . E). ABC has a specific pattern. However, DE is so common that there will be many false matches when not attached to DE. But what you are saying is a sort of ```split('/(reserve words)/')```, then perform the pattern match (without stop words) on the entire pattern for each chunk? I'm not 100% sure that will work, (I'm mulling over my stop words, and if I can split on them out of context or not). Perhaps I need to refine my example. – Atomox Jan 04 '18 at 18:23
  • Yes, the first regex is something like *split*, but "ordinary" split would require to drop the first fragment (before the 1st unique pattern). – Valdi_Bo Jan 04 '18 at 20:24
0

to ensure a pattern not occurs in a repeating character sequence such as (\s|[^\[\]])*, note that \s is included in [^\[\]] so may be just [^\[\]]*, is to prepend a negative lookahead (which is a zero lentgh match assertion like ^) at the left and inside the repeating pattern so that it is checked for every character :

((?!UniquePrefixA)(\s|[^\[\]]))*
Nahuel Fouilleul
  • 16,821
  • 1
  • 26
  • 32
  • Can you show me an example with more than one Option? I'm specifically looking for something like ```((?!UniquePrefixA|PrefixB|UniqueC)(\s|[^\[\]]))*```. I have already seen examples where this would work with only a single word, as you demonstrate above. Also, thanks! – Atomox Jan 04 '18 at 17:50
  • Also, good time on `\s*` being included via the [^] group. Thanks! – Atomox Jan 04 '18 at 18:25