1

I would like to insert spaces between characters in word, but only for word with at least 2 upper case characters. I can use regex.

For example: "This is simple SEnTeNCE with a FEW word." -> "This is simple S E n T e N C E with a F E W word."

Kara
  • 5,650
  • 15
  • 48
  • 55

2 Answers2

4

A way with PHP/PCRE:

$pattern = '~(?:\b(?=(?:\w*[A-Z]){2})|(?!^)\G)\w\B\K~';

$text = preg_replace($pattern, ' ', $text);

pattern details:

(?:                      # non capturing group: begin with:
    \b                   # a word boundary 
    (?=(?:\w*[A-Z]){2})  # followed by a word with two uppercase letter at least
  |                      # OR
    (?!^)\G              # anchor: end of last match
)
\w\B                     # a word character followed by an other word character
\K                       # reset the match from match result

A way with Javascript with a callback:

var str = "This is simple SEnTeNCE with a FEW word.";

var res = str.replace(/\b(?:[a-z]*[A-Z]){2,}[a-z]*\b/g, function (m) {
    return  m.split('').join(' '); } );

console.log(res);
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
  • Can you elaborate on how you're using `(?!^)\G` and `\K`? I understand the lookbehind for start-of-input, but I've not used match anchors or resetters. – aliteralmind Apr 25 '14 at 16:06
  • @aliteralmind: `\G` matches the position in the string after the last match, but at the start (before the first match) since this position isn't defined, `\G` is an anchor for the start of the string. To forbid a match at the start of the string you can simply add `(?!^)` or `(?!\A)` or `(? – Casimir et Hippolyte Apr 25 '14 at 16:41
  • @aliteralmind: about the `\K` feature. The `\K` doesn't change the match but it only removes all that has been matched on its left from the result. A proof, you can not obtain overlapping results. Example: with the string "abcd", the pattern `abc\Kd|abc` will give only "d" but the second part of the alternation will never produce a result, since "abc" has been yet matched by the first part. – Casimir et Hippolyte Apr 25 '14 at 16:43
  • @aliteralmind: The use of `\K` here is a convenience that avoids to put all on the left in a capture group to make a reference in the replacement string. An other use of `\K`, it can be interesting when you are facing a problem of variable length lookbehind. – Casimir et Hippolyte Apr 25 '14 at 16:58
  • @aliteralmind: To understand the role of `\G` in a global search: The schema of the pattern is `(?: entry-point | \G – Casimir et Hippolyte Apr 25 '14 at 17:06
  • @aliteralmind: Note that if I follow strictly the schema for the current pattern, I must write `\K\B` instead of `\B\K`, where `\B` is `the-condition-to-break-the-contiguity`. But it doesn't matter here. – Casimir et Hippolyte Apr 25 '14 at 17:15
  • This is great information Casimir. I think it would be a nice addition to the FAQ. There's a good entry on [`\G`](http://stackoverflow.com/questions/21971701/when-is-g-useful-application-in-a-regex), but the `\K` one could be improved, and there isn't any on using them together. If you have the time, put this information into your answer and consider adding a "walkthrough" of this specific example. Is that okay? – aliteralmind Apr 25 '14 at 18:00
  • @aliteralmind: Why not? I can write an "how to" like. – Casimir et Hippolyte Apr 25 '14 at 18:03
1

A one regex solution would be (PCRE):

(?|(?=\b(?:[a-z]*[A-Z]){2})(\w)|(?!^)\G(\w))(?!\b)

(?|                             # branch reset group
  (?= \b (?:[a-z]* [A-Z]){2} )  # look ahead anchored at the begining of the word:
                                # check we are the beginning of a two-upper word
  (\w)                          # grab the first letter
|                               # OR
  (?!^)\G                       # we're following a previous match (and not
                                # at the beginning of the string)
  (\w)                          # if so we're inside a wanted word, so we grab
                                # a character
  (?!\b)                        # except if it's the last one (we don't want
                                # too many spaces)
)

And replace with

\1 # <- there's a space after the \1

See demo here.

Note that it might be easier to do it in more steps (grabbing the words, treating them individually, joining everything)...

Robin
  • 8,479
  • 2
  • 30
  • 44