Split string of varying length using regex

Question

I don't know if this is possible using regex. I'm just asking in case someone knows the answer.

I have a string ="hellohowareyou??". I need to split it like this

[h, el, loh, owar, eyou?, ?].

The splitting is done such that the first string will have length 1, second length 2 and so on. The last string will have the remaining characters. I can do it easily without regex using a function like this.

public ArrayList<String> splitString(String s)
    {
        int cnt=0,i;
        ArrayList<String> sList=new ArrayList<String>();
        for(i=0;i+cnt<s.length();i=i+cnt)
        {
         cnt++;
         sList.add(s.substring(i,i+cnt));    
        }
        sList.add(s.substring(i,s.length()));
        return sList;
    }

I was just curious whether such a thing can be done using regex.

Sounds to me like "Can someone show me, how to make it complicated?" ;-) — splash, Sep 10 '10 at 06:23
It doesn't matter if you make it complicated.I'm just learning the possibilities of regex thats all. — Emil, Sep 10 '10 at 06:27
I personally and sincerely thank you for this problem. I had lots of fun trying to solve it, learned a lot in the process, and I also used forward references for the first time. I've revised my answer several times now with solutions that I think gets better each time. I may revisit this problem again in the future if I come up with an even more elegant solution. — polygenelubricants, Sep 11 '10 at 14:09
@poly:It's actually me who should be thanking you.In-fact most of my questions answers were answered by you and I have become a huge fan of yours.I would like to ask you one thing if you don't mind saying.What is this 'polygenelubricants'? — Emil, Sep 12 '10 at 05:07
In Java, `"polygenelubricants".hashCode()` is `Integer.MIN_VALUE`. In two's complement representation, the most negative value is unique for not having a positive counterpart. That is, in Java, `Math.abs(Integer.MIN_VALUE)` is a negative number. Credit for the string goes to Josh Bloch who used it in one of his puzzlers. — polygenelubricants, Sep 12 '10 at 06:53

score 9 · Accepted Answer · edited Jun 20 '20 at 09:12

Solution

The following snippet generates the pattern that does the job (see it run on ideone.com):

// splits at indices that are triangular numbers
class TriangularSplitter {
 
  // asserts that the prefix of the string matches pattern
  static String assertPrefix(String pattern) {
    return "(?<=(?=^pattern).*)".replace("pattern", pattern);
  }
  // asserts that the entirety of the string matches pattern
  static String assertEntirety(String pattern) {
    return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
  }
  // repeats an assertion as many times as there are dots behind current position
  static String forEachDotBehind(String assertion) {
    return "(?<=^(?:.assertion)*?)".replace("assertion", assertion);
  }
  
  public static void main(String[] args) {
    final String TRIANGULAR_SPLITTER =
      "(?x) (?<=^.) | measure (?=(.*)) check"
        .replace("measure", assertPrefix("(?: notGyet . +NBefore +1After)*"))
        .replace("notGyet", assertPrefix("(?! \\1 \\G)"))
        .replace("+NBefore", forEachDotBehind(assertPrefix("(\\1? .)")))
        .replace("+1After", assertPrefix(".* \\G (\\2?+ .)"))
        .replace("check", assertEntirety("\\1 \\G \\2 . \\3"))
        ;
    String text = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
    System.out.println(
        java.util.Arrays.toString(text.split(TRIANGULAR_SPLITTER))
    );
    // [a, bc, def, ghij, klmno, pqrstu, vwxyzAB, CDEFGHIJ, KLMNOPQRS, TUVWXYZ]
  }
}

Note that this solution uses techniques already covered in my regex article series. The only new thing here is \G and forward references.

References

This is a brief description of the basic regex constructs used:

(?x) is the embedded flag modifier to enable the free-spacing mode, where unescaped whitespaces are ignored (and # can be used for comments).
^ and $ are the beginning and end-of-the-line anchors. \G is the end-of-previous match anchor.
| denotes alternation (i.e. "or").
? as a repetition specifier denotes optional (i.e. zero-or-one of). As a repetition quantifier in e.g. .*? it denotes that the * (i.e. zero-or-more of) repetition is reluctant/non-greedy.
(…) are used for grouping. (?:…) is a non-capturing group. A capturing group saves the string it matches; it allows, among other things, matching on back/forward/nested references (e.g. \1).
(?=…) is a positive lookahead; it looks to the right to assert that there's a match of the given pattern.(?<=…) is a positive lookbehind; it looks to the left.
(?!…) is a negative lookahead; it looks to the right to assert that there isn't a match of a pattern.

Explanation

The pattern matches on zero-width assertions. A rather complex algorithm is used to assert that the current position is a triangular number. There are 2 main alternatives:

(?<=^.), i.e. we can lookbehind and see the beginning of the string one dot away
- This matches at index 1, and is a crucial starting point to the rest of the process
Otherwise, we measure to reconstruct how the last match was made (using \G as reference point), storing the result of the measurement in "before" \G and "after" \G capturing groups. We then check if the current position is the one prescribed by the measurement to find where the next match should be made.

Thus the first alternative is the trivial "base case", and the second alternative sets up how to make all subsequent matches after that. Java doesn't have custom-named groups, but here are the semantics for the 3 capturing groups:

\1 captures the string "before" \G
\2 captures some string "after" \G
If the length of \1 is e.g. 1+2+3+...+k, then the length of \2 needs to be k.
- Hence \2 . has length k+1 and should be the next part in our split!
\3 captures the string to the right of our current position
- Hence when we can assertEntirety on \1 \G \2 . \3, we match and set the new \G

You can use mathematical induction to rigorously prove the correctness of this algorithm.

To help illustrate how this works, let's work through an example. Let's take abcdefghijklm as input, and say that we've already partially splitted off [a, bc, def].

          \G     we now need to match here!
           ↓       ↓
a b c d e f g h i j k l m n
\____1____/ \_2_/ . \__3__/   <--- \1 G \2 . \3
  L=1+2+3    L=3

Remember that \G marks the end of the last match, and it occurs at triangular number indices. If \G occured at 1+2+3+...+k, then the next match needs to be k+1 positions after \G to be a triangular number index.

Thus in our example, given where \G is where we just splitted off def, we measured that k=3, and the next match will split off ghij as expected.

To have \1 and \2 be built according to the above specification, we basically do a while "loop": for as long as it's notGyet, we count up to k as follows:

+NBefore, i.e. we extend \1 by one forEachDotBehind
+1After, i.e. we extend \2 by just one

Note that notGyet contains a forward reference to group 1 which is defined later in the pattern. Essentially we do the loop until \1 "hits" \G.

Conclusion

Needless to say, this particular solution has a terrible performance. The regex engine only remembers WHERE the last match was made (with \G), and forgets HOW (i.e. all capturing groups are reset when the next attempt to match is made). Our pattern must then reconstruct the HOW (an unnecessary step in traditional solutions, where variables aren't so "forgetful"), by painstakingly building strings by appending one character at a time (which is O(N^2)). Each simple measurement is linear instead of constant time (since it's done as a string matching where length is a factor), and on top of that we make many measurements which are redundant (i.e. to extend by one, we need to first re-match what we already have).

There are probably many "better" regex solutions than this one. Nonetheless, the complexity and inefficiency of this particular solution should rightfully suggest that regex is not the designed for this kind of pattern matching.

That said, for learning purposes, this is an absolutely wonderful problem, for there is a wealth of knowledge in researching and formulating its solutions. Hopefully this particular solution and its explanation has been instructive.

score 5 · Answer 2 · answered Sep 10 '10 at 06:40

5

Regex purpose is to recognize patterns. Here you doesn't search for patterns but for a length split. So regex are not appropriate.

It is propably possible, but not with a single regex : to find the first n characters using a regex, you use: "^(.{n}).*"

So, you can search with that regex for the 1st character. Then, you make a substring, and you search for the 2 next characters. Etc.

Like @splash said, it will make the code more complicated, and unefficient, since you use regex for something outside of their purpose.

answered Sep 10 '10 at 06:40

Benoit Courtine

6,716
27
40

the word "patterns" can be interpreted very broadly. My solution is a single regex that does the job, so by that definition, there is a well-defined splitting delimiter "pattern". That said, yes, it's definitely an overkill to use regex to solve this task, but boy, can you learn so much from it. – polygenelubricants Sep 11 '10 at 14:11

score 0 · Answer 3 · answered Sep 10 '10 at 08:47

String a = "hellohowareyou??";
int i = 1;

    while(true) {

        if(i >= a.length()) {
            System.out.println(a);
            break;
        }

        else {
            String b = a.substring(i++);
            String[] out = a.split(Pattern.quote(b) + "$");
            System.out.println(out[0]);
            a = b;
            if(b.isEmpty())
                break;
        }

    }

Split string of varying length using regex

3 Answers3

Solution

References

Related questions

Explanation

Conclusion

Linked