Simply for academic reasons, I would like to present the regex solution, too. Mostly, because you are probably using the only regex engine that is capable of solving this.
After clearing up some interesting issues about the combination of .NET's unique features, here is the code that gets you the desired results:
string mainString = @"~(Homo Sapiens means (human being)) or man or ~woman";
List<string> checkList = new List<string> { "homo sapiens", "human", "man", "woman" };
// build subpattern "(?:homo sapiens|human|man|woman)"
string searchAlternation = "(?:" + String.Join("|", checkList.ToArray()) + ")";
MatchCollection matches = Regex.Matches(
mainString,
@"(?<=~|(?(Depth)(?!))~[(](?>[^()]+|(?<-Depth>)?[(]|(?<Depth>[)]))*)"+searchAlternation,
RegexOptions.IgnoreCase
);
Now how does this work? Firstly, .NET supports balancing groups, which allow for detection of correctly nested patterns. Every time we capture something with a named capturing group
(like (?<Depth>somepattern)
) it does not overwrite the last capture, but instead is pushed onto a stack. We can pop one capture from that stack with (?<-Depth>)
. This will fail, if the stack is empty (just like something that does not match at the current position). And we can check whether the stack is empty or not with (?(Depth)patternIfNotEmpty|patternIfEmpty)
.
In addition to that, .NET has the only regex engine that supports variable-length lookbehinds. If we can use these two features together, we can look to the left of one of our desired strings and see whether there is a ~(
somewhere outside the current nesting structure.
But here is the catch (see the link above). Lookbehinds are executed from right to left in .NET, which means that we need to push closing parens and pop on encountering opening parens, instead of the other way round.
So here is for some explanation of that murderous regex (it's easier to understand if you read the lookbehind from bottom to top, just like .NET would do):
(?<= # lookbehind
~ # if there is a literal ~ to the left of our string, we're good
| # OR
(?(Depth)(?!)) # if there is something left on the stack, we started outside
# of the parentheses that end end "~("
~[(] # match a literal ~(
(?> # subpattern to analyze parentheses. the > makes the group
# atomic, i.e. suppresses backtracking. Note: we can only do
# this, because the three alternatives are mutually exclusive
[^()]+ # consume any non-parens characters without caring about them
| # OR
(?<-Depth>)? # pop the top of stack IF possible. the last ? is necessary for
# like "human" where we start with a ( before there was a )
# which could be popped.
[(] # match a literal (
| # OR
(?<Depth>[)]) # match a literal ) and push it onto the stack
)* # repeat for as long as possible
) # end of lookbehind
(?:homo sapiens|human|man|woman)
# match one of the words in the check list