3

I've been wrestling with an issue I was hoping to solve with regex.

Let's say I have a string that can contain any alphanumeric with the possibility of a substring within being surrounded by square brackets. These substrings could appear anywhere in the string like this. There can also be any number of bracket-ed substrings.

Examples:

  • aaa[bb b]
  • aaa[bbb]ccc[d dd]
  • [aaa]bbb[c cc]

You can see that there are whitespaces in some of the bracketed substrings, that's fine. My main issue right now is when I encounter spaces outside of the brackets like this:

  • a aa[bb b]

Now I want to preserve the spaces inside the brackets but remove them everywhere else.

This gets a little more tricky for strings like:

  • a aa[bb b]c cc[d dd]e ee[f ff]

Here I would want the return to be:

  • aaa[bb b]ccc[d dd]eee[f ff]

I spent some time now reading through different reg ex pages regarding lookarounds, negative assertions, etc. and it's making my head spin.

NOTE: for anyone visiting this, I was not looking for any solution involving nested brackets. If that was the case I'd probably do it pragmatically like some of the comments mentioned below.

seano
  • 45
  • 1
  • 4

6 Answers6

11

This regex should do the trick:

[ ](?=[^\]]*?(?:\[|$))

Just replace the space that was matched with "".

Basically all it's doing is making sure that the space you are going to remove has a "[" in front of it, but not if it has a "]" before it.

That should work as long as you don't have nested square brackets, e.g.:

a a[b [c c]b]

Because in that case, the space after the first "b" will be removed and it will become:

aa[b[c c]b]

Senseful
  • 73,679
  • 56
  • 267
  • 405
  • Awesome, thank you. I was somewhat close, but I couldn't handle past 2 sets of bracketed substrings. And I did not need nested brackets (phew!). – seano Jul 31 '09 at 13:48
  • 1
    The '|$' at the end is required in case your string is something like 'a aa[bb b]c cc[d dd]e ee[f ff]g gg', to get rid of the space between the g's. They don't have a '[' following them, so you also want to check for end of string ('$'). You are correct that the '[' inside the first character class is not required. That is because '.*?b' is essentially the same as '[^b]*b' as long as that's the end of the regex. This was just left over from while I was writing it in the first place before I used the '?' character. It's interesting to note however that '.+?b' is not the same as '[^b]+b'. – Senseful Jul 31 '09 at 15:29
8

This doesn't sound like something you really want regex for. It's very easy to parse directly by reading through. Pseudo-code:

inside_brackets = false;
for ( i = 0; i < length(str); i++) {
    if (str[i] == '[' )
        inside_brackets = true;
    else if str[i] == ']'
        inside_brackets = false;
    if ( ! inside_brackets && is_space(str[i]) )
        delete(str[i]);
}

Anything involving regex is going to involve a lot of lookbehind stuff, which will be repeated over and over, and it'll be much slower and less comprehensible.

To make this work for nested brackets, simply change inside_brackets to a counter, starting at zero, incrementing on open brackets, and decrementing on close brackets.

Cascabel
  • 422,485
  • 65
  • 357
  • 307
  • Heh, good thing I checked for new answers before posting mine. That's almost exactly what I had, except my pseudocode didn't look as much like PHP. – Michael Myers Jul 30 '09 at 21:26
  • Actually it shouldn't involve any look behind if there is no nesting, and your code also assumes no nesting. – Senseful Jul 30 '09 at 21:27
  • Depending on the language, this may need to be expanded to handle nested brackets. But this is probably the best approach. – derobert Jul 30 '09 at 21:28
  • eagle, I was a bit imprecise (read as "incorrect"). What I was thinking of was the fact that, nested or not, for every bracket you have to find the matching close. You're right, you're really looking for repetitions of the pattern /\\[[^\\]]\\]/. – Cascabel Jul 30 '09 at 21:32
2

This works for me:

(\[.+?\])|\s

Then you simply pass in a replacement value of $1 when you call the replace function. The idea is to look for the patterns inside the brackets first and make sure they're untouched. And then every space outside the brackets gets replaced with nothing.

Note that I tested this with Regex Hero (a .NET regex tester), and not in PHP. So I'm not 100% sure this will work for you.

That was an interesting one. Sounded simple at first, then seemed rather difficult. And then the solution I finally arrived at was indeed simple. I was surprised the solution didn't require a lookaround of any sort. And it should be faster than any method that uses a lookaround.

Steve Wortham
  • 20,322
  • 4
  • 62
  • 86
  • Upvoting for rare use of a beautiful, simple but efficient technique! (Technique is detailed on [current bounty question](http://stackoverflow.com/questions/23589174/match-a-pattern-except-in-three-situations-s1-s2-s3/23589204#23589204) ) Btw Steve in PHP you can also use (*SKIP)(*F) which is the same idea but even faster, added quick answer to that effect. :) – zx81 May 19 '14 at 22:45
1

How to do this depends on what should be done with:

a b [ c [ d [ e ] f ] g

That is ambiguous; possible answers are at least:

  • ab[ c [ d [ e ] f ]g
  • ab[ c [ d [ e ]f]g
  • error out; the brackets don't match!

For the first two cases, you can use regexps. For the third case, you'd be much better off with a (small) parser.

For either case one or two, split the string on the first [. Strip spaces from everything before [ (that's obviously outside of the brackets). Next, look for .*\] (case 1) or .*?\] (case 2) and move that over to your output. Repeat until you're out of input.

derobert
  • 45,779
  • 11
  • 86
  • 120
1

Resurrecting this question because it had a simple solution that wasn't mentioned.

\[[^]]*\](*SKIP)(*F)|\s+

The left side of the alternation matches complete sets of brackets then deliberately fails. The right side matches and captures spaces to Group 1, and we know they are the right spaces because if they were within brackets they would have been failed by the expression on the left.

See the matches in this demo

This means you can just do

$replace = preg_replace("~\[[^]]*\](*SKIP)(*F)|\s+~","",$string);

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. How to match a pattern unless...
Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97
0

The following will match start-of-line or end-of-bracket (which must come before any space you want to match) followed by anything that isn't start-of-bracket or a space, followed by some space.

/((^|\])[^ \[]*) +/

replacing "all" with $1 will remove the first block of spaces from each non-bracketed sequence. You will have to repeat the match to remove all spaces.

Example:

abcd efg [hij klm]nop qrst u
abcdefg [hij klm]nopqrst u
abcdefg[hij klm]nopqrstu
done
Draemon
  • 31,587
  • 13
  • 71
  • 102