2

I have been trying to write a code that will check if the given string contains certain strings with certain pattern. To be precise, for example:

string mainString = @"~(Homo Sapiens means (human being)) or man or ~woman"
List<string> checkList = new List<string>{"homo sapiens","human","man","woman"};

Now, I want to extract

"homo sapiens", "human" and "woman" but NOT "man"

from the above list as they follow the pattern, i.e string followed by~ or one of the strings inside parenthesis that starts with ~. So far I have come up with:

string mainString = @"~(Homo Sapiens means (human being)) or man or ~woman"
List<string> checkList = new List<string>{"homo sapiens","human","man","woman"};
var prunedList = new List<string>();
foreach(var term in checkList)
{
   var pattern = @"~(\s)*(\(\s*)?(\(?\w\s*\)?)*" + term + @"(\s*\))?";
   Match m = Regex.Match(mainString, pattern);
   if(m.success)
   {
      prunedList.Add(term);
   }
 }

But this pattern is not working for all cases... Can any one suggest me how this can be done?

5 Answers5

2

I wrote a simple parser that works well for the example you gave.

I don't know what the expected behavior is for a string that ends in this pattern: ~(some words (ie, no closing parenthesis with valid opening)

I'm sure you could clean this up some...

private bool Contains(string source, string given)
{
    return ExtractValidPhrases(source).Any(p => RegexMatch(p, given));
}

private bool RegexMatch(string phrase, string given)
{
    return Regex.IsMatch(phrase, string.Format(@"\b{0}\b", given), RegexOptions.IgnoreCase);
}

private IEnumerable<string> ExtractValidPhrases(string source)
{
    bool valid = false;
    var parentheses = new Stack<char>();
    var phrase = new StringBuilder();

    for(int i = 0; i < source.Length; i++)
    {
        if (valid) phrase.Append(source[i]);

        switch (source[i])
        {
            case '~':
                valid = true;
                break;

            case ' ':
                if (valid && parentheses.Count == 0)
                {
                    yield return phrase.ToString();
                    phrase.Clear();
                }
                if (parentheses.Count == 0) valid = false;
                break;

            case '(':
                if (valid)
                {
                    parentheses.Push('(');
                }
                break;

            case ')':
                if (valid)
                {
                    parentheses.Pop();
                }
                break;
        }
    }

    //if (valid && !parentheses.Any()) yield return phrase.ToString();
    if (valid) yield return phrase.ToString();
}

Here are the tests I used:

// NUnit tests
[Test]
[TestCase("Homo Sapiens", true)]
[TestCase("human", true)]
[TestCase("woman", true)]
[TestCase("man", false)]
public void X(string given, bool shouldBeFound)
{
    const string mainString = @"~(Homo Sapiens means (human being)) or man or ~woman";

    Assert.AreEqual(shouldBeFound, Contains(mainString, given));
}

[Test]
public void Y()
{
    const string mainString = @"~(Homo Sapiens means (human being)) or man or ~woman";
    var checkList = new List<string> {"homo sapiens", "human", "man", "woman"};
    var expected = new List<string> { "homo sapiens", "human", "woman" };

    var filtered = checkList.Where(s => Contains(mainString, s));

    CollectionAssert.AreEquivalent(expected, filtered);
}
Austin Salonen
  • 46,038
  • 15
  • 101
  • 135
2

The language of balanced parenthesis is not regular and as a result you cannot accomplish what you want using RegEx. A better approach would be to use traditional string parsing with a couple of counters - one for open paren and one for close parens - or a stack to create a model similar to a Push Down Automaton.

To get a better idea of the concept check out PDA's on Wikipedia. http://en.wikipedia.org/wiki/Pushdown_automaton

Below is an example using a stack to get strings inside the out most parens (pseudo code).

 Stack stack = new Stack();
 char[] stringToParse = originalString.toCharArray();

 for (int i = 0; i < stringToParse.Length; i++)
 {
      if (stringToParse[i] == '(')
            stack.push(i);
      if (stringToParse[i] == ')')
         string StringBetweenParens = originalString.GetSubstring(stack.pop(), i);
 }

Now of course this is a contrived example and would need some work to do more serious parsing, but it gives you the basic idea of how to do it. I've left out things like; the correct function names (don't feel like looking them up right now), how to get text in nested parens like getting "inner" out of the string "(outer (inner))" (that function would return "outer (inner)"), and how to store the strings you get back.

evanmcdonnal
  • 38,588
  • 13
  • 84
  • 107
2

Simply for academic reasons, I would like to present the regex solution, too. Mostly, because you are probably using the only regex engine that is capable of solving this.

After clearing up some interesting issues about the combination of .NET's unique features, here is the code that gets you the desired results:

string mainString = @"~(Homo Sapiens means (human being)) or man or ~woman";
List<string> checkList = new List<string> { "homo sapiens", "human", "man", "woman" };

// build subpattern "(?:homo sapiens|human|man|woman)"
string searchAlternation = "(?:" + String.Join("|", checkList.ToArray()) + ")";

MatchCollection matches = Regex.Matches(
    mainString,
    @"(?<=~|(?(Depth)(?!))~[(](?>[^()]+|(?<-Depth>)?[(]|(?<Depth>[)]))*)"+searchAlternation,
    RegexOptions.IgnoreCase
);

Now how does this work? Firstly, .NET supports balancing groups, which allow for detection of correctly nested patterns. Every time we capture something with a named capturing group (like (?<Depth>somepattern)) it does not overwrite the last capture, but instead is pushed onto a stack. We can pop one capture from that stack with (?<-Depth>). This will fail, if the stack is empty (just like something that does not match at the current position). And we can check whether the stack is empty or not with (?(Depth)patternIfNotEmpty|patternIfEmpty).

In addition to that, .NET has the only regex engine that supports variable-length lookbehinds. If we can use these two features together, we can look to the left of one of our desired strings and see whether there is a ~( somewhere outside the current nesting structure.

But here is the catch (see the link above). Lookbehinds are executed from right to left in .NET, which means that we need to push closing parens and pop on encountering opening parens, instead of the other way round.

So here is for some explanation of that murderous regex (it's easier to understand if you read the lookbehind from bottom to top, just like .NET would do):

(?<=              # lookbehind
  ~               # if there is a literal ~ to the left of our string, we're good
|                 # OR
  (?(Depth)(?!))  # if there is something left on the stack, we started outside
                  # of the parentheses that end end "~("
  ~[(]            # match a literal ~(
  (?>             # subpattern to analyze parentheses. the > makes the group
                  # atomic, i.e. suppresses backtracking. Note: we can only do
                  # this, because the three alternatives are mutually exclusive
    [^()]+        # consume any non-parens characters without caring about them
  |               # OR
    (?<-Depth>)?  # pop the top of stack IF possible. the last ? is necessary for
                  # like "human" where we start with a ( before there was a )
                  # which could be popped.
    [(]           # match a literal (
  |               # OR
    (?<Depth>[)]) # match a literal ) and push it onto the stack
  )*              # repeat for as long as possible
)                 # end of lookbehind
(?:homo sapiens|human|man|woman)
                  # match one of the words in the check list
Community
  • 1
  • 1
Martin Ender
  • 40,690
  • 9
  • 78
  • 120
1

This is not possible using regular expressions. You should abandon idea of using them and use normal string operations like IndexOf.

Euphoric
  • 12,015
  • 1
  • 26
  • 41
  • 1
    Albeit ugly, it is definitely possible with regex engines as powerful as .NET's (might be possible with PCRE, too). – Martin Ender Nov 14 '12 at 21:27
  • Ya, I think I should abandon the idea of using regex.. the other idea i have is to trim space between ~ and the following word, and break the sentence based on spaces, and for the broken words starting with ~, check if the words in list is present in there. – user1801163 Nov 14 '12 at 21:32
1

Paranthesis checking is a context-free language or grammar which requires a stack for checking. Regular expressions are suitable for regular languages. They do not have memory, therefore they cannot be used for such purposes.

To check this you need to scan the string and count the parentheses:

  • initialize count to 0
  • scan the string
    • if current character is ( then increment count
    • if current character is ) then decrement count
    • if count is negative, raise an error that parentheses are inconsistent; e.g., )(
  • In the end, if count is positive, then there are some unclosed parenthesis
  • If count is zero, then the test is passed

Or in C#:

public static bool CheckParentheses(string input)
{
    int count = 0;
    foreach (var ch in input)
    {
        if (ch == '(') count++;
        if (ch == ')') count--;

        // if a parenthesis is closed without being opened return false
        if(count < 0)
            return false;
    }

    // in the end the test is passed only if count is zero
    return count == 0;
}

You see, since regular expressions are not capable of counting, then they cannot check such patterns.

Sina Iravanian
  • 15,001
  • 4
  • 28
  • 44