3

im trying to make a regex for matching a set of words.

For example, if i am matching a set of words - American Tea

Then in the string American Tea is awesome. Do you like American Tea? love WowAmerican Tea #American Tea there will be only 2 matches here,

'American Tea is awesome. Do you like American Tea? love WowAmerican Tea #American Tea'

So, i am trying to do only full matches of the word set.

I tried some approaches, but havent got the correct regex :( If anyone can help or can point me in a direction it would be really helpful.

Check this

'American Tea lalalal qwqwqw American Tea sdsdsd #American Tea'.match(/(?:^|\s)(American Tea)(?=\s|$)/g)

the result of this is ["American Tea", " American Tea"]

I do not want the space in the second match, i want the match result to be ["American Tea", "American Tea"]

(no space in front of the second American Tea)

Roko C. Buljan
  • 164,703
  • 32
  • 260
  • 278
ghostCoder
  • 7,009
  • 8
  • 43
  • 63
  • So, you want 3 matches or 2? What space issue do you mean? A leading space? Show the code, and it will be clearer what you are up to. In general, in JS, you have to use *capturing* when you need to use both lookbehind and lookahead. Like `(^|\s)(American Tea)(?=$|\s)` here. – Wiktor Stribiżew Jan 07 '16 at 20:37
  • what i have is /(?:^|\s)(American Tea)(?=\s|$)/g but it has a space issue – ghostCoder Jan 07 '16 at 20:42
  • It does not have any issues. The issue is **how** you are using it. A regex is poor in JS (poorer than in PHP, .NET, Java, etc), but the language has all what it needs to make up for it. Without the code, the question is impossible to answer. – Wiktor Stribiżew Jan 07 '16 at 20:43
  • edited the question to add a little more detail – ghostCoder Jan 07 '16 at 20:47
  • i know the word set that i need to match in the string that i have. im using this in my textbox highlighter, to highlight the usage of words as the user types. so if he types a wordset like 'American Tea', i want to match it and highlight it. so i dont want to highlight #American Tea – ghostCoder Jan 07 '16 at 20:50
  • I suspected that. Please post the *function* that highlights the words. I guess all you want is to use backreferences correctly. – Wiktor Stribiżew Jan 07 '16 at 20:50
  • @stribizhev yes i need the indices as i need to replace the matched word with something to highlight the typed words – ghostCoder Jan 07 '16 at 20:53
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/100090/discussion-between-stribizhev-and-ghostcoder). – Wiktor Stribiżew Jan 07 '16 at 21:12

4 Answers4

2

Use .replace() for fun and profit

/(?:^|\s)(american tea)/ig

https://regex101.com/r/qB0uO2/1

if you want to account for prefixes AND suffixes:

/(?:^|\s)(american tea)(?:\W|$)/ig 

https://regex101.com/r/qB0uO2/2

JSBIN EXAMPLE

var str = "American Tea is awesome. Do you like American Tea? love WowAmerican Tea #American Tea";

str.replace(/(?:^|\s)(american tea)(?:\W|$)/ig, function(i, m){
  console.log(m);
});

//"American Tea"
//"American Tea"

EDIT:

The above returns only the matches, if instead you want to preserve the capturing and matching prefixes and suffixes use capturing-groups for them aswell:

var str = "American Tea is awesome. Do you like American Tea? love WowAmerican Tea #American Tea";

var newStr = str.replace(/(^|\s)(american tea)(\W|$)/ig, function(im, p1, p2, p3){
  return  p1 +"<b>"+ p2 +"</b>"+ p3; // p1 and p3 will help preserve the pref/suffix
});

document.getElementById("result").innerHTML = newStr;
<div id="result"></div>

where the parts

  • p1 is the first matching group (any prefix)
  • p2 is the second matching group (the "American Tea" word)
  • p3 is the third matching group (any suffix)
Roko C. Buljan
  • 164,703
  • 32
  • 260
  • 278
  • @stribizhev yes that's exatly what OP wants. To match unprefixed American Tea – Roko C. Buljan Jan 07 '16 at 20:58
  • This will give the matches as ["American Tea", " American Tea"] what i need is ["American Tea", "American Tea"] (no space in the match) – ghostCoder Jan 07 '16 at 21:04
  • @ghostCoder see the jsbin example. The results are totally correct. – Roko C. Buljan Jan 07 '16 at 21:05
  • but 'American Tea lalalal qwqwqw American Tea sdsdsd #American Tea'.match(/(?:^|\s)(american tea)/ig) gives ["American Tea", " American Tea"] – ghostCoder Jan 07 '16 at 21:12
  • same issue. i used 'American Tea lalalal qwqwqw American Tea sdsdsd #American Tea'.replace(/(?:^|\s)(American Tea)(?:\W|$)/g, function(i, m){ return 'wow'; }); the result i got was - 'wow lalalal qwqwqwwow sdsdsd #American Tea' – ghostCoder Jan 07 '16 at 21:25
  • the spaces next to the second 'American Tea' got replaced too. – ghostCoder Jan 07 '16 at 21:26
  • I will check tomorrow when I finally get some sleep this week. – Wiktor Stribiżew Jan 07 '16 at 21:26
  • 1
    @ghostCoder please.... http://jsbin.com/cilosi/1/edit?html,css,js,console,output see the console? You see any errors cause I don't. You're making issues, not my answer :) – Roko C. Buljan Jan 07 '16 at 21:32
  • @RokoC.Buljan can you please check this https://jsfiddle.net/hy812kgr/ This has the same expression as the one you gave – ghostCoder Jan 07 '16 at 21:37
  • i want to replace the text American Tea with wow without removing the spaces next to the word. thats what i am trying to achieve here. – ghostCoder Jan 07 '16 at 21:38
  • @ghostCoder edited my answer to add what you asked for (keep the matching prefixes and suffixes). See now. – Roko C. Buljan Jan 07 '16 at 22:03
0

Reading the comments I realized that a regex might not be the best solution for this. However, it is pretty interesing how you would circumvent the fact that Javascript does not support a positive lookbehind which would make this task easy.

If JS had the (?<=...) construct, then you would just use a positive lookbehind and a positive lookahead and list all the characters which you want to allow to the left and right of American Tea. So what we want is something like this:

(?<=\s|\.|,|:|;|\?|\!|^)American Tea(?=\s|\.|,|:|;|\?|\!|$)

To the left, you would allow any of the listed characters and the start of the string ^. To the right, you allow the same characters and the end of the string $.

But Javascript does not have the (?<=...) construct. So we will have to get a little creative:

(?=(\s|\.|,|:|;|\?|\!|^))\1(American Tea)(?=\s|\.|,|:|;|\?|\!|$)

This regex substitutes the positive lookbehind with a positive lookahead. Then it matches whatever it has found in the lookahead with \1 and finally American Tea will be in capturing group 1.

Demo: https://regex101.com/r/qX9qR3/3

timgeb
  • 64,821
  • 18
  • 95
  • 124
0

You don't need regexes to match words.

I know a very neat CoffeeScript snippet :

wordList = ["coffeescript", "eko", "talking", "play framework", "and stuff", "falsy"]
tweet = "This is an example tweet talking about javascript and stuff."

wordList.some (word) -> ~tweet.indexOf word # returns true

Which compiles into the following javascript :

var tweet, wordList;

wordList = ["coffeescript", "eko", "talking", "play framework", "and stuff", "falsy"];

tweet = "This is an example tweet talking about javascript and stuff.";

wordList.some(function(word) { // returns true
  return ~tweet.indexOf(word); 
});

~ is not a special operator in CoffeeScript, just a cool trick. It is the bitwise NOT operator, which inverts the bits of its operand. In practice it equates to -x-1. Here it works on the basis that we want to check for an index greater than -1, and -(-1)-1 == 0 evaluates to false.

If you want the words that are matched, use :

wordList.filter (word) -> ~tweet.indexOf word # returns : [ "talking", "and stuff" ]

Or the same in JS :

wordList.filter(function(word) { // returns : [ "talking", "and stuff" ]
  return ~tweet.indexOf(word);
});
Jeremy Thille
  • 21,780
  • 7
  • 36
  • 54
0

While Jeremy is of course right, I assume there is more to your problem than visible in your contrived example.

From what it looks like you're trying to have regular RegEx word boundaries with the exception that you consider "#" part of the word characters. In that case you can do something like this: (where \b means "word boundary")

(^|[^#])\bAmerican Tea\b

Or, if you simply want to list the characters that you consider non word characters you can do something like this to simulate word boundaries:

(^|[^A-Za-z])American Tea($|[^A-Za-z])

You can play around e.g. at http://www.regexr.com/

Martin Rauscher
  • 1,215
  • 1
  • 12
  • 15