10

Can anybody explain to me how to use preg_split() function? I didn't understand the pattern parameter like this "/[\s,]+/".

for example:

I have this subject: is is. and I want the results to be:

array (
  0 => 'is',
  1 => 'is',
)

so it will ignore the space and the full-stop, how I can do that?

Jared Farrish
  • 46,034
  • 16
  • 88
  • 98
MD.MD
  • 720
  • 3
  • 13
  • 32
  • 2
    What are the exact rules you're working with? Are you trying to get a list of words out of a string? – ceejayoz Jun 12 '14 at 16:43
  • @ceejayoz yes I want to store the words only into the array. – MD.MD Jun 12 '14 at 16:45
  • I don't know what your input can be, but exploding on space and trimming would probably be easier. – jeroen Jun 12 '14 at 16:48
  • @jeroen whatever the input is, the array should store the words only. – MD.MD Jun 12 '14 at 16:50
  • 1
    `preg_split()` is good if you want to chop up a string, and if you know exactly how you want to do it. `preg_match()` might be a good alternative, if you know what you want to get out of the string. In this case you want to extract words; `preg_match` might be a better choice. You should consider it. – Sverri M. Olsen Jun 12 '14 at 16:51
  • I often find it is simpler (from a pattern matching perspective) to pre-sanitize the string I am working with. For instance, do a global replace on it stripping out any irrelevant characters like punctuation etc. It may not be as efficient to execute, but it saves my brain dribbling out of my ears trying to craft a regexp that works otherwise. – Majenko Jun 12 '14 at 17:01
  • 1
    str_word_count is a better option than preg_split here. It can return an array of words. – ceejayoz Jun 12 '14 at 17:03

4 Answers4

33

preg means Pcre REGexp", which is kind of redundant, since the "PCRE" means "Perl Compatible Regexp".

Regexps are a nightmare to the beginner. I still don’t fully understand them and I’ve been working with them for years.

Basically the example you have there, broken down is:

"/[\s,]+/"

/ = start or end of pattern string
[ ... ] = grouping of characters
+ = one or more of the preceeding character or group
\s = Any whitespace character (space, tab).
, = the literal comma character

So you have a search pattern that is "split on any part of the string that is at least one whitespace character and/or one or more commas".

Other common characters are:

. = any single character
* = any number of the preceeding character or group
^ (at start of pattern) = The start of the string
$ (at end of pattern) = The end of the string
^ (inside [...]) = "NOT" the following character

For PHP there is good information in the official documentation.

Majenko
  • 1,720
  • 14
  • 24
7

This should work:

$words = preg_split("/(?<=\w)\b\s*[!?.]*/", 'is is.', -1, PREG_SPLIT_NO_EMPTY);

echo '<pre>';
print_r($words);
echo '</pre>';

The output would be:

Array
(
    [0] => is
    [1] => is
)

Before I explain the regex, just an explanation on PREG_SPLIT_NO_EMPTY. That basically means only return the results of preg_split if the results are not empty. This assures you the data returned in the array $words truly has data in it and not just empty values which can happen when dealing with regex patterns and mixed data sources.

And the explanation of that regex can be broken down like this using this tool:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
    \w                       word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  [!?.]*                   any character of: '!', '?', '.' (0 or more
                           times (matching the most amount possible))

An nicer explanation can be found by entering the full regex pattern of /(?<=\w)\b\s*[!?.]*/ in this other other tool:

  • (?<=\w) Positive Lookbehind - Assert that the regex below can be matched
  • \w match any word character [a-zA-Z0-9_]
  • \b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
  • \s* match any white space character [\r\n\t\f ]
  • Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
  • !?. a single character in the list !?. literally

That last regex explanation can be boiled down by a human—also known as me—as the following:

Match—and split—any word character that comes before a word boundary that can have multiple spaces and the punctuation marks of !?..

Giacomo1968
  • 23,903
  • 10
  • 59
  • 92
1

Documentation says:

The preg_split() function operates exactly like split(), except that regular expressions are accepted as input parameters for pattern.

So, the following code...

<?php

$ip = "123 ,456 ,789 ,000"; 
$iparr = preg_split ("/[\s,]+/", $ip); 
print "$iparr[0] <br />";
print "$iparr[1] <br />" ;
print "$iparr[2] <br />"  ;
print "$iparr[3] <br />"  ;

?>

This will produce following result.

123
456
789
000 

So, if have this subject: is is and you want: array ( 0 => 'is', 1 => 'is', )

you need to modify your regex to "/[\s]+/"

Unless you have is ,is you need the regex you already have "/[\s,]+/"

Giacomo1968
  • 23,903
  • 10
  • 59
  • 92
Federico Piazza
  • 27,409
  • 11
  • 74
  • 107
  • @FedericoPiazza In preg_split \s means space, then what about + – Gem Jul 17 '19 at 05:11
  • 1
    @Gem `\s` means any white space (includes tabs) the `+` means _1 or more_. If you change the + by *, it means _0 or more_. You can use regex101.com to see detailed explanation of regexs – Federico Piazza Jul 17 '19 at 15:10
1

PHP's str_word_count may be a better choice here.

str_word_count($string, 2) will output an array of all words in the string, including duplicates.

ceejayoz
  • 165,698
  • 38
  • 268
  • 341