Match all substrings that end with 4 digits using regular expressions

Question

I am trying to split a string in php, which looks like this:

ABCDE1234ABCD1234ABCDEF1234

Into an array of string which, in this case, would look like this:

ABCDE1234
ABCD1234
ABCDEF1234

So the pattern is "an undefined number of letters, and then 4 digits, then an undefined number of letters and 4 digits etc."

I'm trying to split the string using preg_split like this:

$pattern = "#[0-9]{4}$#";
preg_split($pattern, $stringToSplit);

And it returns an array containing the full string (not split) in the first element.

I'm guessing the problem here is my regex as I don't fully understand how to use them, and I am not sure if I'm using it correctly.

So what would be the correct regex to use?

Are you sure though you can't just split the string after a number is followed by a letter? From your example it seems like you can totally do that. — user1306322, Nov 03 '16 at 15:29
Why can't you just simply find every place where you have a digit-letter pair, that would give you the positions to break the string — Brad Thomas, Nov 03 '16 at 20:51
Could have done that, I didnt' realize it... well it works this way ! — DevBob, Nov 04 '16 at 09:17

score 16 · Accepted Answer · answered Nov 03 '16 at 13:51

16

You don't want preg_split, you want preg_match_all:

$str = 'ABCDE1234ABCD1234ABCDEF1234';
preg_match_all('/[a-z]+[0-9]{4}/i', $str, $matches);
var_dump($matches);

Output:

array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(9) "ABCDE1234"
    [1]=>
    string(8) "ABCD1234"
    [2]=>
    string(10) "ABCDEF1234"
  }
}

answered Nov 03 '16 at 13:51

mister martin

5,746
3
24
58

This answer is missing its explanation. – mickmackusa Mar 19 '20 at 21:33

asontu · Answer 2 · 2016-11-04T09:25:33.943

8

PHP uses PCRE-style regexes which let you do lookbehinds. You can use this to see if there are 4 digits "behind" you. Combine that with a lookahead to see if there's a letter ahead of you, and you get this:

(?<=\d{4})(?=[a-z])

Notice the dotted lines on the Debuggex Demo page. Those are the points you want to split on.

In PHP this would be:

var_dump(preg_split('/(?<=\d{4})(?=[a-z])/i', 'ABCDE1234ABCD1234ABCDEF1234'));

edited Nov 04 '16 at 09:25

answered Nov 03 '16 at 13:50

asontu

4,264
1
17
25

In the "PHP" regex you don't seem to need to explicitly check for 4 digits, you could just check for a digit followed by letter? ie. `/(?<=\d)(?=[a-z])/i` _(+1)_ – MrWhite Nov 03 '16 at 17:09
@w3dk in this case yes, I just like my regex explicit when possible :) – asontu Nov 04 '16 at 09:25

score 6 · Answer 3 · answered Nov 03 '16 at 13:51

Use the principle of contrast:

\D+\d{4}
# requires at least one non digit
# followed by exactly four digits

See a demo on regex101.com.

In PHP this would be:

<?php
$string = 'ABCDE1234ABCD1234ABCDEF1234';
$regex = '~\D+\d{4}~';
preg_match_all($regex, $string, $matches);
?>

See a demo on ideone.com.

MonkeyZeus · Answer 4 · 2016-11-03T15:18:13.883

2

I'm no good at regex so here is the road less traveled:

<?php
$s = 'ABCDE1234ABCD1234ABCDEF1234';
$nums = range(0,9);

$num_hit = 0;
$i = 0;
$arr = array();

foreach(str_split($s) as $v)
{
    if(isset($nums[$v]))
    {
        ++$num_hit;
    }

    if(!isset($arr[$i]))
    {
        $arr[$i] = '';
    }

    $arr[$i].= $v;

    if($num_hit === 4)
    {
        ++$i;
        $num_hit = 0;
    }
}

print_r($arr);

edited Nov 03 '16 at 15:18

answered Nov 03 '16 at 13:58

MonkeyZeus

18,445
3
30
67

This answer is missing its explanation. The `$nums` lookup array can be avoided with the implementation of `ctype_digit()` in your conditional expression instead of `isset()`. – mickmackusa Mar 19 '20 at 21:32

mickmackusa · Answer 5 · 2020-03-22T12:17:47.953

First, why is your attempted pattern not delivering the desired output? Because the $ anchor tells the function to explode the string by using the final four numbers as the "delimiter" (characters that should be consuming while dividing the string into separate parts).

Your result:

array (
  0 => 'ABCDE1234ABCD1234ABCDEF', // an element of characters before the last four digits
  1 => '',  // an empty element containing the non-existent characters after the four digits
)

In plain English, to fix your pattern, you must:

Not consume any characters while exploding and
Ensure that no empty elements are generated.

My snippet is at the bottom of this post.

Second, there seems to be some debate about what regex function to use (or even if regex is a preferrable tool).

My stance is that using a non-regex method will require a long-winded block of lines which will be equally if not more difficult to read than a regex pattern. Using regex affords you to generate your result in one-line and not in an unsightly fashion. So let's dispose of iterated sets of conditions for this task.
Now the critical concern is whether this task is simply "extracting" data from a consistent and valid string (case "A"), or if it is "validating AND extracting" data from a string (case"B") because the input cannot be 100 trusted to be consistent/correct.
- In case A, you needn't concern yourself with producing valid elements in the output, so preg_split() or preg_match_all() are good candidates.
- In case B, preg_split() would not be advisable, because it only hunts for delimiting substrings -- it remains ignorant of all other characters in the string.
Assuming this task is case A, then a decision is still pending about the better function to call. Well, both functions generate an array, but preg_match_all() creates a multidimensional array while you desire a flat array (like preg_split() provides). This means you would need to add a new variable to the global scope ($matches) and append [0] to the array to access the desired fullstring matches. To someone who doesn't understand regex patterns, this may border on the bad practice of using "magic numbers".

For me, I strive to code for Directness and Accuracy, then Efficiency, then Brevity and Clarity. Since you're not likely to notice any performance drops while performing such a small operation, efficiency isn't terribly important. I just want to make some comparisons to highlight the cost of a pattern that leverages only look-arounds or a pattern that misses an oportunity to greedily match predictable characters.

/(?<=\d{4})(?=[a-z])/i 79 steps (Demo)
~\d{4}\K~ 25 steps (Demo)
/[a-z]+[0-9]{4}\K/i 13 steps (Demo)
~\D+[0-9]{4}\K~ 13 steps (Demo)
~\D+\d{4}\K~ 13 steps (Demo)

FYI, \K is a metacharacter that means "restart the fullstring match", in other words "forget/release all previously matched characters up to this point". This effectively ensures that no characters are lost while spitting.

Suggested technique: (Demo)

var_export(
    preg_split(
        '~\D+\d{4}\K~',                // pattern
        'ABCDE1234ABCD1234ABCDEF1234', // input
        0,                             // make unlimited explosions
        PREG_SPLIT_NO_EMPTY            // exclude empty elements
    )
);

Output:

array (
  0 => 'ABCDE1234',
  1 => 'ABCD1234',
  2 => 'ABCDEF1234',
)

Match all substrings that end with 4 digits using regular expressions

5 Answers5