3

How do I build a regular expression that matches all sequences having ABC, DBE, ABE, FBG, and so on, but not XBZ?

My example sequences ABC, DBE, etc. were merely representative. I am not searching for those specific patterns. A, B, C, D, E, etc can take the form of any pattern. For example, X, B, and Z can be words.

Specifically, I am looking to find all instances that contain B but are not preceded by X or not followed by Z.

I have come up with a workaround solution using the grep -v option which inverts the matching:

cat file | grep -ne ".*B.*" | grep -ve "XBZ"

But I would rather have a single regular expression.

Arjan
  • 20,227
  • 10
  • 57
  • 70
dfernan
  • 367
  • 1
  • 7
  • 18
  • 4
    Why is `XBZ` the odd one out? Please explain. – devnull Jun 27 '13 at 16:32
  • I think this may have been answered in http://stackoverflow.com/questions/406230/regular-expression-to-match-string-not-containing-a-word. – Rob Lyndon Jun 27 '13 at 16:33
  • 2
    Do you mean "all sequences of three capital letters except XBZ"? Or "all sequences of three capital letters, B being the second one, except XBZ"? – giordano Jun 27 '13 at 16:45
  • 1
    Also, why is ABE accepted? Did you mean to type EBF? The pattern you describe seems to be: Select a letter of the alphabet: `H` call it `letterOne`. Take the next letter of the Alphabet: `I`, call it `letterTwo`. Make a string: `letterOne + "B" + letterTwo`. – Shaz Jun 27 '13 at 16:54
  • 1
    What does "and so on" mean? – Ed Heal Jun 27 '13 at 20:39
  • Use a regular expression with a capture group, and verify that the result isn't XBZ? – Daniel Jun 27 '13 at 21:45
  • 1
    `grep` can work on a file, you don't need `cat | grep`: `grep "[^X]B[^Z]" file` is what you're looking for. – giordano Jun 28 '13 at 12:21
  • @giordano: X and Z can be a combination of any character. The `^` negation within `[]` only works for a single character. – dfernan Jun 28 '13 at 12:34
  • @dfernandes__ sorry, I missed that `X`, `B`, `Z` can be words. – giordano Jun 28 '13 at 12:40

7 Answers7

3

It took a while to get there, but this pattern:

(.*((?!X).B|B(?!Z).))|(^B)|(B$)

looks for either (something that is not X)B or B(something that is not Z). The TDD code is as follows:

[Test]
public void TestPattern()
{
    const string pattern = "(.*((?!X).B|B(?!Z).))|(^B)|(B$)";

    Assert.IsFalse(Regex.IsMatch("Hello", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello ABC", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello DBE", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello ABE", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello FBG", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello ABC World", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello DBE World", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello ABE World", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello FBG World", pattern));
    Assert.IsTrue(Regex.IsMatch("ABC World", pattern));
    Assert.IsTrue(Regex.IsMatch("DBE World", pattern));
    Assert.IsTrue(Regex.IsMatch("ABE World", pattern));
    Assert.IsTrue(Regex.IsMatch("FBG World", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello DBE World XBZ", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello ABE World XBZ", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello FBG World XBZ", pattern));
    Assert.IsFalse(Regex.IsMatch("Hello XBZ", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello XB", pattern));
    Assert.IsTrue(Regex.IsMatch("Hello BZ", pattern));
    Assert.IsTrue(Regex.IsMatch("XB Hello", pattern));
    Assert.IsTrue(Regex.IsMatch("BZ Hello", pattern));
    Assert.IsTrue(Regex.IsMatch("B", pattern));
}
Rob Lyndon
  • 10,563
  • 3
  • 39
  • 57
  • The problem is that I do not know the other possible patterns that include B in the middle. A, C, D, E, etc. can be anything. – dfernan Jun 28 '13 at 11:51
  • OK, I get what you're saying. "Specifically, I am looking to find all instances that contain B but are not preceded by X or not followed by Z". I'll update the answer in a couple of minutes. – Rob Lyndon Jun 28 '13 at 12:57
  • Just to clarify, would you expect a match on "B XBZ"? – Rob Lyndon Jun 28 '13 at 13:00
  • I'm expecting the answer to that question to be yes -- the match with B is enough. – Rob Lyndon Jun 28 '13 at 13:09
  • In which case, I think something like .*((!X)B(!Z)) would satisfy your requirements, but I'd like to run it through a TDD harness first. – Rob Lyndon Jun 28 '13 at 13:12
  • Not working. A sample file with `XBZ\nABZ` and `grep -e '.*((!X)B(!Z))' sample` outputs nothing. I've tried with the `-E` (extended regexp) and `-P` (Perl regexp) options and still does not work. – dfernan Jun 28 '13 at 13:30
  • OK, I've got a pattern that gives true for "Hello XB" and "Hello BZ" but false for "Hello XBZ". Posting that now, with the TDD code. – Rob Lyndon Jun 28 '13 at 13:46
  • 1
    This definitely seems to be working! Thanks Rob. For using negative lookahead with `grep`, one has to use the Perl regex engine, which is the `-P` switch. More info here: http://stackoverflow.com/questions/9197814/regex-lookahead-for-not-followed-by-in-grep – dfernan Jun 28 '13 at 16:13
2

While regular expressions are closed under negation, there is no negation operator in standard regexs. This is purely syntax problem, nothing prevents regex engine writer to make add non-standard negation operator in grammar... So, it has to be rewritten as a group of alternatives:

^([^X]..|X[^B].|XB[^Z])$

I don't know better way...

P.S. There is negation operator ^ that works inside [...], but it matches only single char. It is used above.

monoid
  • 1,610
  • 11
  • 16
  • This can match lines of infinite length and I assume the OP wants 3 characters total. ABC/*7984.,as matches. – Shaz Jun 27 '13 at 16:50
2

Here is a perl way to do the job:

my $re = qr/(?<!X)B(?!Z)/;
while(<DATA>) {
    chomp;
    say /$re/ ? "OK : $_" : "KO : $_";
}
__DATA__
ABC
DBE
ABE
FBG
XBZ

output:

OK : ABC
OK : DBE
OK : ABE
OK : FBG
KO : XBZ

Explanation:

(?-imsx:(?<!X)B(?!Z))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
----------------------------------------------------------------------
    X                        'X'
----------------------------------------------------------------------
  )                        end of look-behind
----------------------------------------------------------------------
  B                        'B'
----------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
    Z                        'Z'
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
Toto
  • 83,193
  • 59
  • 77
  • 109
1

You can do this with negative look ahead assertions

(?!^XBZ$)
Michael Davis
  • 2,020
  • 2
  • 17
  • 28
1

I wrote a function to write a regex based on the assumption in my comment. Here are the assumptions:

  • These are three character strings
  • Character one is taken from the alphabet
  • Character two is always the same. In OP's post this is B.
  • Character three is character one + 1.
  • Characters one and three cannot equal character two.

    static void writeRegex(char skip)
    {
    string mydocpath = Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments);
    StringBuilder sb = new StringBuilder();
    sb.Append("^(");
    char one = 'A';
    char two = 'B';
    bool first = true;
    for (; one < 'Z' && two <= 'Z' ; )
    {
        if (!first)
        {
            sb.Append("|");   
        }
        first = false;
    
        if (one == skip)
        {
            one++;
        }
        if (two == skip || one == two)
        {
            two++;
        }
    
        sb.Append(one.ToString() + skip.ToString() + two.ToString());
    
        one++;
        two++;
    }
    sb.Append(")$");
    
    using (StreamWriter outfile = new StreamWriter(mydocpath + @"\Regex.txt"))
    {
        outfile.Write(sb.ToString());
    }
    

    }

When given the input of 'B' this produces:

^(ABC|CBD|DBE|EBF|FBG|GBH|HBI|IBJ|JBK|KBL|LBM|MBN|NBO|OBP|PBQ|QBR|RBS|SBT|TBU|UBV|VBW|WBX|XBY|YBZ)$

There is no negation, only brute force of all acceptable constructions of the three characters.

Shaz
  • 1,347
  • 8
  • 18
1

The notation used by W3C for specifying XML or XQuery has the - operator for exclusion, and it can be very handy to have it available. See for example this rule for (case-insensitively) excluding the word "XML":

PITarget ::= Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

A DFA-based regular expression engine could easily support this kind of exclusion by making use of the fact that regular expressions are closed under difference. Yet you don't find it implemented very often.

One parser/lexer generator that has it is REx, using W3C notation. It will go open-source at some point, but I need more time to supply some missing bits, most notably documentation.

Using that notation, your example could look something like this:

Letter ::= [A-Z]
Three-Letter-Code ::= (Letter Letter Letter) - 'XBZ'
Gunther
  • 4,851
  • 1
  • 20
  • 33
1

I think people are overthinking this question. If I understand the question correctly - that you want the regex to match a set of specific sequences, but not some other specific sequence - the answer is simply that you don't have to tell a regex what not to match. It matches only what fits the pattern you specify, and nothing else. ABC|DBE|ABE|FBG matches ABC or DBE or ABE or FBG, and doesn't match any other sequence, including XBZ. You don't have to specifically instruct it not to match XBZ.

Adi Inbar
  • 10,985
  • 13
  • 49
  • 65