7

how to make it not hungry - preg_match_all('/"[\p{L}\p{Nd}а-яА-ЯёЁ -_\.\+]+"/ui', $outStr, $matches);

tchrist
  • 74,913
  • 28
  • 118
  • 169
Arthur Kushman
  • 2,809
  • 9
  • 44
  • 64
  • 1
    I believe the general term is 'lazy' – josh.trow May 12 '11 at 13:02
  • 1
    @josh: Actually, it's "greedy". – Lightness Races in Orbit May 12 '11 at 13:07
  • 3
    Actually, terms like *greedy* and *lazy* are only colloquial shortcuts for longer and more technical terms, shortcuts which can sometimes mask what is really happening. The more technical terms are that quantifiers can match *maximally*, *minimally*, or *possessively*, where `*`, `+`, `?`, and `{n,m}` are the **maximal set**; `*?`, `+?`, `??`, and `{n,m}?` are the **minimal set**; and `*+`, `++`, and `{n,m}+` are the **possessive set**. Plus I suppose `?+` for completeness’ sake, but it doesn’t change what it does: think about it. – tchrist May 12 '11 at 13:15
  • 1
    @Tomalak: I believe I said it correctly - 'not hungry' == 'lazy'. I think you were thinking 'hungry' == 'greedy' – josh.trow May 12 '11 at 13:19

3 Answers3

11

Do you mean non-greedy, as in find the shortest match instead of the longest? The *, +, and ? quantifiers are greedy by default and will match as much as possible. Add a question mark after them to make them non-greedy.

preg_match_all('/"[\p{L}\p{Nd}а-яА-ЯёЁ -_\.\+]+?"/ui', $outStr, $matches);

Greedy match:

"foo" and "bar"
^^^^^^^^^^^^^^^

Non-greedy match:

"foo" and "bar"
^^^^^
John Kugelman
  • 307,513
  • 65
  • 473
  • 519
3

See: http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

U (PCRE_UNGREEDY)

This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by ?. It is not compatible with Perl. It can also be set by a (?U) modifier setting within the pattern or by a question mark behind a quantifier (e.g. .*?).

Treffynnon
  • 20,415
  • 5
  • 59
  • 95
Yoshi
  • 51,516
  • 13
  • 81
  • 100
  • 2
    Oops! That means PHP and Java use the `(?U)` flag differently. In PHP it turns on the `PCRE_UNGREEDY` regex compile flag, but in JDK7 it turns on `UNICODE_CHARACTER_CLASS` regex compile flag to make character classes conform to the [spec on Unicode regexes](http://www.unicode.org/reports/tr18/#Compatibility_Properties) — which is something PHP already does by default (I believe!), since Perl already does. Hm, reading the *pcrepattern* manpage leaves me mildly suspicious. It looks like it is only `[\pL\pN_]`, which isn’t quite what RL1.2 cited above wants. But it’s better than ASCII. – tchrist May 12 '11 at 13:10
  • 5
    generally a bad idea to use U (flip the quantifier behavior) unless you really know what you're doing. It is a lot more clear and gives you more control to flip it for each one individually using ? – Crayon Violent May 12 '11 at 13:11
  • This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Modifiers". – aliteralmind Apr 10 '14 at 00:46
  • Note, the `(?U)` modifier is [_unique_](http://www.regular-expressions.info/modifiers.html) to PCRE (and derivatives like PHP and R) and won't be found in e.g. JavaScript, Python, or Perl. An earlier comment noted it behaves completely differently in Java. – Adam Katz Feb 09 '16 at 00:06
2

ou suggested

/"[\p{L}\p{Nd}а-яА-ЯёЁ -_\.\+]+"/ui

which I submit is equivalent to:

/"[\pL\p{Nd}а-яА-ЯёЁ -_.+]+"/ui

To show people which non-ASCII you’re using in case it is not obvious, using \x{⋯} escapes that is:

/"[\pL\p{Nd}\x{430}-\x{44F}\x{410}-\x{42F}\x{451}\x{401} -_.+]+"/ui

And using named characters is:

/"[\pL\p{Nd}\N{CYRILLIC SMALL LETTER A}-\N{CYRILLIC SMALL LETTER YA}\N{CYRILLIC CAPITAL LETTER A}-\N{CYRILLIC CAPITAL LETTER YA}\N{CYRILLIC SMALL LETTER IO}\N{CYRILLIC CAPITAL LETTER IO} -_.+]+"/ui

BTW, those are produced by running them through the uniquote script, the first using uniquote -x and the second using uniquote -v.

And yes, I know or at least believe that PHP doesn’t support named characters yet, but it makes it easier to talk about. Also, it makes sure they don't confuse the lookalikes:

U+0410 ‹А› \N{CYRILLIC CAPITAL LETTER A}
U+0430 ‹а› \N{CYRILLIC SMALL LETTER A}
U+0401 ‹Ё› \N{CYRILLIC CAPITAL LETTER IO}
U+0451 ‹ё› \N{CYRILLIC SMALL LETTER IO}

for:

U+0041 ‹A› \N{LATIN CAPITAL LETTER A}
U+0061 ‹a› \N{LATIN SMALL LETTER A}
U+00CB ‹Ë› \N{LATIN CAPITAL LETTER E WITH DIAERESIS}
U+00EB ‹ë› \N{LATIN SMALL LETTER E WITH DIAERESIS}

And now I think about it, those are all letters, so I cannot not see why you are enumerating the Cyrillic list. It is because you don’t want all Cyrillic letters, but rather just that particular set of them? Otherwise I would just do:

/"[\pL\p{Nd} -_.+]+"/ui

At which point I wonder about that /i. I can’t see what its purpose is, so would just write:

/"[\pL\p{Nd} -_.+]+"/u

As has been mentioned, swapping the maximally quantifying + for its corresponding minimal version, +?, will work:

/"[\pL\p{Nd} -_.+]+?"/u

However, I am concerned about that range of [ -_], that is, \p{SPACE}-\p{LOW LINE}. I find that a very peculiar range. It means any of these

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_

For one thing, you’ve included the capital ASCII letters again. For another, you’ve omitted some symbols and punctuation characters:

% unichars -g '\p{ASCII}' '[\pS\pP]' 'ord() < ord(" ") || ord() > ord("_")'
 `  U+0060 GC=Sk GRAVE ACCENT
 {  U+007B GC=Ps LEFT CURLY BRACKET
 |  U+007C GC=Sm VERTICAL LINE
 }  U+007D GC=Pe RIGHT CURLY BRACKET
 ~  U+007E GC=Sm TILDE

(That output is from the unichars script, in case you’re curious.)

Which seems strangely arbitrary. So I’m wondering whether this might not be good enough for you:

/"[\pL\p{Nd}\s\pS\pP]+?"/u

Now that I think about it, these two might cause other problems:

U+0401 ‹Ё› \N{CYRILLIC CAPITAL LETTER IO}
U+0451 ‹ё› \N{CYRILLIC SMALL LETTER IO}

That assumes those are in NFC form (formed by canonical composition of a canonical decomposition). If there were a chance that you are dealing with data that hasn’t been normalized to NFC form, then you would have to account for

NFD("\N{CYRILLIC CAPITAL LETTER IO}") => "\N{CYRILLIC SMALL LETTER IE}\N{COMBINING DIAERESIS}"
NFD("\N{CYRILLIC SMALL LETTER IO}")   => "\N{CYRILLIC CAPITAL LETTER IE}\N{COMBINING DIAERESIS}"

And now you have non-letters! The

% uniprops "COMBINING DIAERESIS"
U+0308 ‹◌̈› \N{COMBINING DIAERESIS}
    \w \pM \p{Mn}
    All Any Assigned InCombiningDiacriticalMarks Case_Ignorable CI Combining_Diacritical_Marks Dia Diacritic M Mn Gr_Ext Grapheme_Extend Graph GrExt ID_Continue IDC Inherited Zinh Mark Nonspacing_Mark Print Qaai Word XID_Continue XIDC

So maybe you would actually want:

/"[\pL\pM\p{Nd}\s\pS\pP]+?"/u

If you wanted to restrict your string to containing only characters that are from the Latin or Cyrillic scripts (and not, say, Greek or Katakana), then you would add a lookahead to that effect:

/"(?:(?=[\p{Latin}\p{Cyrillic}])[\pL\pM\p{Nd}\s\pS\pP])+?"/u

Except that you also need Common to get the digits and various puntuation and symbols, and you need Inherited for combining marks following your letters. That brings us up to this:

/"(?:(?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])[\pL\pM\p{Nd}\s\pS\pP])+?"/u

That now suggests another way to effect a minimal match between the double quotes:

/"(?:(?!")(?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])[\pL\pM\p{Nd}\s\pS\pP])+"/u

Which is getting way complicated not to run in /x mode:

/
    "               # literal double quote
    (?:
  ### This group specifies a single char with
  ### three separate constraints:

        # Constraint 1: next char must NOT be a double quote
        (?!")

        # Constraint 2: next char must be from one of these four scripts
        (?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])

        # Constraint 3: match one of either Letter, Mark, Decimal Number,
        #               whitespace, Symbol, or Punctuation:
        [\pL\pM\p{Nd}\s\pS\pP]

    )       # end constraint group
    +       # repeat entire group 1 or more times
    "       # and finally match another double-quote
/ux

If it were Perl, I would write that with m{⋯}xu

m{
    "               # literal double quote
    (?:
  ### This group specifies a single char with
  ### three separate constraints:

        # Constraint 1: next char must NOT be a double quote
        (?!")

        # Constraint 2: next char must be from one of these four scripts
        (?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])

        # Constraint 3: match one of either Letter, Mark, Decimal Number,
        #               whitespace, Symbol, or Punctuation:
        [\pL\pM\p{Nd}\s\pS\pP]

    )       # end constraint group
    +       # repeat entire group 1 or more times
    "       # and finally match another double-quote
}ux

But I do not know whether you can do paired, bracketing delimiters like that in PHP.

Hope this helps!

tchrist
  • 74,913
  • 28
  • 118
  • 169