3

In my javascript app I have this random string:

büert AND NOT 3454jhadf üasdfsdf OR technüology AND (bar OR bas)

and i would like to match all words special chars and numbers besides the words AND, OR and NOT.

I tried is this

/(?!AND|OR|NOT)\b[\u00C0-\u017F\w\d]+/gi
which results in
["büert", "3454jhadf", "asdfsdf", "technüology", "bar", "bas"]

but this one does not match the ü or any other letter outside the a-z alphabet at the beginning or at the end of a word because of the \b word boundary.

removing the \b oddly ends up matching part or the words i would like to exclude:

/(?!AND|OR|NOT)[\u00C0-\u017F\w\d]+/gi
result is
["büert", "ND", "OT", "3454jhadf", "üasdfsdf", "R", "technüology", "ND", "bar", "R", "bas"]

what is the correct way to match all words no matter what type of characters they contain besides the ones i want exclude?

aschmid00
  • 6,798
  • 2
  • 42
  • 63
  • 2
    Closely related: [Javascript RegExp + Word boundaries + unicode characters](http://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters) – apsillers Jan 07 '16 at 13:23
  • The `(?:^|\\s)` solution in the linked question above will exclude all matches after/before punctuation and in similar contexts. Negated character class should be used to ensure the same functionality as with `\b`. – Wiktor Stribiżew Jan 07 '16 at 13:41

1 Answers1

3

The issue here has its roots in the fact that \b (and \w, and other shorthand classes) are not Unicode-aware in JavaScript.

Now, there are 2 ways to achieve what you want.

1. SPLIT WITH PATTERN(S) YOU WANT TO DISCARD

var re = /\s*\b(?:AND|OR|NOT)\b\s*|[()]/;
var s = "büert AND NOT 3454jhadf üasdfsdf OR technüology AND (bar OR bas)";
var res = s.split(re).filter(Boolean);
document.body.innerHTML += JSON.stringify(res, 0, 4);
// = > [ "büert", "3454jhadf üasdfsdf", "technüology", "bar", "bas" ]

Note the use of a non-capturing group (?:...) so as not to include the unwanted words into the resulting array. Also, you need to add all punctuation and other unwanted characters to the character class.

2. MATCH USING CUSTOM BOUNDARIES

You can use groupings with anchors/reverse negated character class in a regex like this:

(^|[^\u00C0-\u017F\w])(?!(?:AND|OR|NOT)(?=[^\u00C0-\u017F\w]|$))([\u00C0-\u017F\w]+)(?=[^\u00C0-\u017F\w]|$)

The capure group 2 will hold the values you need.

See regex demo

JS code demo:

var re = /(^|[^\u00C0-\u017F\w])(?!(?:AND|OR|NOT)(?=[^\u00C0-\u017F\w]|$))([\u00C0-\u017F\w]+)(?=[^\u00C0-\u017F\w]|$)/gi; 
var str = 'büert AND NOT 3454jhadf üasdfsdf OR technüology AND (bar OR bas)';
var m;
var arr = []; 
while ((m = re.exec(str)) !== null) {
  arr.push(m[2]);
}
document.body.innerHTML += JSON.stringify(arr);

or with a block to build the regex dynamically:

var bndry = "[^\\u00C0-\\u017F\\w]";
var re = RegExp("(^|" + bndry + ")" +                   // starting boundary
           "(?!(?:AND|OR|NOT)(?=" + bndry + "|$))" +    // restriction
           "([\\u00C0-\\u017F\\w]+)" +                  // match and capture our string
           "(?=" + bndry + "|$)"                        // set trailing boundary
           , "g"); 
var str = 'büert AND NOT 3454jhadf üasdfsdf OR technüology AND (bar OR bas)';
var m, arr = []; 
while ((m = re.exec(str)) !== null) {
  arr.push(m[2]);
}
document.body.innerHTML += JSON.stringify(arr);

Explanation:

  • (^|[^\u00C0-\u017F\w]) - our custom boundary (match a string start with ^ or any character outside the [\u00C0-\u017F\w] range)
  • (?!(?:AND|OR|NOT)(?=[^\u00C0-\u017F\w]|$)) - a restriction on the match: the match is failed if there are AND or OR or NOT followed by string end or characters other than those in the \u00C0-\u017F range or non-word character
  • ([\u00C0-\u017F\w]+) - match word characters ([a-zA-Z0-9_]) or those from the \u00C0-\u017F range
  • (?=[^\u00C0-\u017F\w]|$) - the trailing boundary, either string end ($) or characters other than those in the \u00C0-\u017F range or non-word character.
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • This does not look readable due to the restriction to skip AND/OR/NOT. Perhaps, you could use blocks to build the RegExp dynamically to make it more readable. – Wiktor Stribiżew Jan 07 '16 at 13:31
  • I think this is too complicated regex and you should add the `split` solution. – Tushar Jan 07 '16 at 14:01
  • Or close as a dupe then. Still there is a difference between these 2 solutions. Although this regex might look monstrous, it is actually a valid way of defining custom boundaries in JS. – Wiktor Stribiżew Jan 07 '16 at 14:06