7

I am trying to search for patterns in a 2D matrix represented as a string. Picture the following:

// horizontal line
String pat1 =
    "............." +
    "............." +
    "............." +
    "....XXXX....." +
    "............." +
    ".............";

// vertical line
String pat2 =
    "............." +
    "......X......" +
    "......X......" +
    "......X......" +
    "......X......" +
    ".............";

Searching for the first pattern would be trivial, the regex would be something like:

X+

In the second case, it is a little trickier but doable since I know the number of columns and rows of the matrix:

(X.{`WIDTH - 1`})+

When I ran into problems to come up with the correct regex was while trying to figure out a way to recognize the following patterns:

// fixed but unknown number of columns
String pat3 =
    "............." +
    ".....XXX....." +
    ".....XXX....." +
    ".....XXX....." +
    ".....XXX....." +
    ".............";

// variable number of columns
String pat4 =
    "............." +
    ".....XXX....." +
    "....XXXXX...." +
    "...XXXXXXX..." +
    ".....XXX....." +
    ".............";

What I am looking for is a way to create a regex pattern equivalent to:

(X.{`WIDTH - PREVCOUNT`})+

Where PREVCOUNT is the length of the last matched pattern (I am aware that I would be missing the first X of the 4th line in pat4, but I can live with that). I know that there are lookaheads in regex, but I wonder if what I am trying to achieve is possible at all. Even if it was possible, I also worry about the performance hit of using lookaheads since I don't fully understand how they work internally.

Is there a way of doing this with a single regex validation, or do I have to search row by row and then try to see if the X's are all contiguous?

Edit: As a clarification, I am trying to search for "blobs" of X's. As long as there are contiguous X's across columns/rows it can be considered as belonging to a blob. A few examples:

String blob1 =
    "............." +
    "......XX....." +
    "....XXXX....." +
    "...XXXXX....." +
    ".....XXX....." +
    ".............";

String blob2 =
    "............." +
    ".....XXX....." +
    "....XXXXX....." +
    "...XXXXXXX..." +
    "....XXXXX...." +
    ".....XXX.....";


String blob3 =
    "............." +
    ".....XXX....." +
    ".....XXX......" +
    ".....XXX....." +
    "............." +
    ".............";


String notblob =
    "............." +
    "..XXX........" +
    "......XXX....." +
    "..XXX........." +
    ".............." +
    ".............";

My solution does not need to be exact, hence why I am trying to use a probably lousy regex approach.

Oscar Wahltinez
  • 1,105
  • 3
  • 11
  • 23
  • can you, please , specify your programming language? thanks. – Michael Simbirsky Nov 03 '13 at 05:05
  • I have been using Java – Oscar Wahltinez Nov 03 '13 at 05:34
  • Not sure what result you are looking for or what you are trying to achieve. Could you post an example of the output of the regex? Are you looking for the index position of each X sequence? or the length of each X sequence? One thing I would like to point out is that there are no columns in your string because you have no linebreak characters: in spite of your code formatting, it's all one line. – Sylverdrag Nov 03 '13 at 07:29
  • Even though it is a single-line String, it represents a 2D matrix with a known number of columns and rows. I'm trying to find "blobs" of a certain pattern (in this case, represented by X). I will edit the question to clarify a bit more – Oscar Wahltinez Nov 03 '13 at 07:32
  • @omtinez Do you need to do this using only regex? – The Guy with The Hat Nov 13 '13 at 00:29
  • No, but I thought that it would have been an interesting and fast approach – Oscar Wahltinez Nov 13 '13 at 05:47
  • Maybe you are interested in a similar question about [matching of "vertical" rows](http://stackoverflow.com/questions/17039670/vertical-regex-matching-in-an-ascii-image) (not blobs) – stema Dec 16 '13 at 15:07

3 Answers3

2

This is not solvable using regular expressions.

Basically, you define a matrix as such:

0^k1 X^l1 0^m1
0^k2 X^l2 0^m2
0^k3 X^l3 0^m3

000XX000
 ^  ^ ^
 k  l m

Where, 0^a means "character '0' repeated a times",
k stands for repetitions of 0 before X
l stands for repetitions of X
m stands for repetitions of 0 after X
ki + li + mi = row_width, for any i

Now, your blob criterion is this:

mi + k(i+1) < row_width
ki + m(i+1) < row_width
these two conditions should meet for any i

Regular languages cannot match such a pattern, they have no memory , so there is no regular-expression solution to your problem.


A proper solution would involve connected-component counting for how many separate components there are.

aec
  • 1,116
  • 8
  • 24
  • You're right, but please see my answer below. If you *really* like regex, you can do it that way. – geert3 Dec 20 '13 at 12:36
1

One elegant solution I think would be to first suppress all single-X sequences, both horizontally and vertically e.g.:

String blob = ".....";
blob.replaceAll("([^X])X([^X])", "$1.$2")
    .replaceAll("([^X].....)X(.....[^X])","$1.$2");

Then all remaining sequences of at least 2 Xes are blobs. Note that to overcome the same issue mentioned by sdanzig, you should first "expand" the blob with a "border" of non-Xes.

geert3
  • 6,453
  • 1
  • 30
  • 45
  • This can be a valid solution. I'm only responding because you said this was a regex solution; it is not, it's Java. It can help OP – aec Dec 20 '13 at 13:24
0

I think I grok what you're trying to do here. The "prevcount" you define isn't enough information to match the pattern. You have to take into account the "next width" in order to determine the number of dots to check. However, I'm not sure if you're really validating even the trivial pattern. X+ will match 5 X's in a row too. And in your second pattern, the first or last line could be two X's, and you wouldn't detect that.

That said, here's a way to provide similar validation with pat3:

(X{3}.{`WIDTH-3`})+

I probably broke another taboo, by repeating the X pattern, but you need to do that in order to keep the repeating pattern in line with the "X-block"'s starts and stops.

pat4 is even trickier. There's no real way to preserve order of your validations checking one line at a time. You could do this:

(X{3}.{`WIDTH-4`}|X{5}.{`WIDTH-6`}|X{5}.{`WIDTH-6`}|X{3}.{`WIDTH-5`})+

But then you'd be vulnerable to validating a matrix with the rows switched around, and the dots changed on each side of the X-blocks to accommodate. However, you could try checking all the lines at once:

(X{3}.{`WIDTH-4`}X{5}.{`WIDTH-6`}X{5}.{`WIDTH-6`}X{3}.{`WIDTH-5`})

And that would not have any extra performance hit. It'd be perhaps more efficient, because you only incur the overhead of starting a regex pattern compile+match once.

Trivial side note: If you're using the width of the matrix for a multi-line string, it won't work. You need to add one, to account for the new line character. Then you need to make sure your "." captures the newline character as well. In Java, you can use Pattern.DOTALL for this.

sdanzig
  • 3,818
  • 1
  • 19
  • 23
  • I noticed the mistake on the multi-line string, my actual input does not really have that hence why I missed the +1. If I understand your answer correctly, you are interpreting my patterns literally, but the number of rows/columns that contain X's are not predictable. In other words, I used `X+` because a line could have 4, 5, or `WIDTH` X's that I'm trying to match; the same applies to all the other patterns. With that in mind, I don't think that any answer with a hardcoded number could be the solution to my problem – Oscar Wahltinez Nov 03 '13 at 07:23
  • I'd suggest making your question less ambiguous, to save others similarly misspent time :) – sdanzig Nov 03 '13 at 07:33
  • My apologies, thank you for taking the time to answer. I added a clarification to my question, please let me know if that is not sufficient to make the question more understandable – Oscar Wahltinez Nov 03 '13 at 07:41