4

This one is a little complicated and somewhat out of my league. I want to sort through a list of words and eliminate those that don't contain a specific set of characters, however those characters can be in any order and some may occur more than others.

I want the regex to look for any words with:

e 0 or 1 times
a 0 or 1 times
t 0 or 1 or 2 times

For example the following would work:

eat tea tate tt a e

The following would not work

eats teas tates ttt aa ee

Lookaround Regex is new to me, so I'm not 100% sure on the syntax (any answer using a lookaround with an explanation would be awesome). My best guess so far:

Regex regex = new Regex(@"(?=.*e)(?=.*a)(?=.*t)");
lines = lines.Where(x => regex.IsMatch(x)).ToArray(); //'text' is array containing words
NealR
  • 8,689
  • 51
  • 142
  • 274
  • I don't think that a Regex can do what you want (please correct me if I'm wrong!). Perhaps you could split the text into words, and then count the number of each characters that appear in each word? Then if the sum of the counts you've done isn't equal to the length of the word, it must mean that there are extra illegal characters floating around. – starbeamrainbowlabs Nov 21 '15 at 17:40
  • That's a possibility. I think that regex can handle something like this... however I could be wrong as well. – NealR Nov 21 '15 at 17:43

2 Answers2

3

Sure:

\b(?:e(?!\w*e)|t(?!(?:\w*t){2})|a(?!\w*a))+\b

Explanation:

\b             # Start of word
(?:            # Start of group: Either match...
 e             # an "e",
 (?!\w*e)      # unless another e follows within the same word,
|              # or
 t             # a "t",
 (?!           # unless...
  (?:\w*t){2}  # two more t's follow within the same word,
 )             # 
|              # or
 a             # an "a"
 (?!\w*a)      # unless another a follows within the same word.
)+             # Repeat as needed (at least one letter)
\b             # until we reach the end of the word.

Test it live on regex101.com.

(I've used the \w character class for simplicity's sake; if you want to define your allowed "word characters" differently, replace this accordingly)

Tim Pietzcker
  • 297,146
  • 54
  • 452
  • 522
  • This is it, isn't it :D or with [little modification](https://regex101.com/r/qP0jQ6/1) @NealR. – bobble bubble Nov 21 '15 at 18:34
  • OP words is actually a `string[]`. So, you do not have to care about how to match *words*, but whole strings. – Wiktor Stribiżew Nov 21 '15 at 18:38
  • The only problem with Tim's regex is that he used `*` instead of `+`, thus allowing it to match empty strings. (See those vertical lines around all the other words in the demo? Which I [updated](https://regex101.com/r/mK9rH7/3), BTW) The word boundaries work just as well as anchors because valid "words" are always composed entirely of letters. – Alan Moore Nov 21 '15 at 18:43
  • For some reason this almost works but not quite. For example - `didn't` `doesn't` and `don't` are getting through. Is it because of the apostrophe? – NealR Nov 22 '15 at 00:02
  • @bobblebubble - your regex works. Only difference I notice is `?` character before each letter in the group.. however again I could be missing something. Example: `(?!.*?e)` vs `(?!\w*e)` – NealR Nov 22 '15 at 00:22
  • @NealR: Yes, `\w` matches letters, digits and the underscore - as I said, it's a simplification. You can create a [character class](http://www.regular-expressions.info/charclass.html) that contains the allowed characters, for example `[A-Za-z']` or (if you want to allow non-ASCII letters) `[\p{L}']` etc. instead of `\w`. The only requirement the regex makes is that every "word" must start and end with an actual letter. Otherwise, the [word boundary anchors](http://www.regular-expressions.info/wordboundaries.html) won't work correctly. – Tim Pietzcker Nov 22 '15 at 08:46
1

This is probably the same as the others, I haven't formatted those to find out.

Note that assertions are coerced to match, they can't be optional
(unless specifically set optional, but what for?) and are not directly affected by backtracking.

This works, explanation is in the formatted regex.

updated
To use a whitespace boundary, use this:

(?<!\S)(?!\w*(?:e\w*){2})(?!\w*(?:a\w*){2})(?!\w*(?:t\w*){3})[eat]+(?!\S)

Formatted:

 (?<! \S )
 (?!
      \w* 
      (?: e \w* ){2}
 )
 (?!
      \w* 
      (?: a \w* ){2}
 )
 (?!
      \w* 
      (?: t \w* ){3}
 )
 [eat]+ 
 (?! \S )

To use an ordinary word boundary, use this:

\b(?!\w*(?:e\w*){2})(?!\w*(?:a\w*){2})(?!\w*(?:t\w*){3})[eat]+\b

Formatted:

 \b                     # Word boundary
 (?!                    # Lookahead, assert Not 2 'e' s
      \w* 
      (?: e \w* ){2}
 )
 (?!                    #  Lookahead, assert Not 2 'a' s
      \w* 
      (?: a \w* ){2}
 )
 (?!                    #  Lookahead, assert Not 3 't' s
      \w* 
      (?: t \w* ){3}
 )
 # At this point all the checks pass, 
 # all thats left is to match the letters.
 # -------------------------------------------------

 [eat]+                 # 1 or more of these, Consume letters 'e' 'a' or 't'
 \b                     # Word boundary
  • I get the same result as the regex below - words like `didn't` `doesn't` and `don't` get through (although this filters the exact same amount of words as the other answer). Again, stretching the boundaries of my regex knowledge here, but could it be because of the apostrophe? – NealR Nov 22 '15 at 00:03
  • @NealR - Added a whitespace boundary one too. That should stop those `do's` and `don't` from being matched. –  Nov 22 '15 at 18:50