How to use PHP preg_match_all, to obtain the contents inside a custom tag(not html),

Question

I've to save user names and bio for a multilingual site. Because number of languages used will be changed overtime, I'm trying to get them from a html textarea in following format.

[lang:en]
Some content some content some content
some content some content
some content 
[endlang:en]

[lang:zh]
有些内容有些内容有些内容
一些内容有些内容
一些内容
[endlang:zh]

So when the form is submitted I want to obtain content separated according to the language. I'm using preg_match_all :

$count = preg_match_all('|\[lang:([a-z]{2})\](.*)\[endlang:[a-z]{2}\]|si',$value,$matches);

But it doesn't catch anything. What should I do to fix this expression?

How are you checking if it matched? (You need to inspect the contents of `$matches` array, not `$count`.) — Amal Murali, Aug 09 '14 at 13:56
@AmalMurali Yes. I'm checking the matches array. $count is used just to find if there are any matches — EastSw, Aug 09 '14 at 13:58

score 3 · Accepted Answer · edited May 23 '17 at 12:20

Your regex is currently greedy; the dot (.) matches as much as it can, so it will match everything between the [lang:xx] tag and [endlang:xx] tag. In order to fix this issue, you can make the pattern lazy by adding a ? at the end, like so:

\[lang:([a-z]{2})\]\R*(.*?)\R*\[endlang:\1\]

Note that I've also used \R in the regex which will capture any vertical whitespace characters in the string — this way, the newline characters will not get included in the match results.

Additionally, the language code from the opening tag can differ from that one used in the matched ending tag. I've used a backreference (\1) in the ending tag to avoid that — it makes the matching more robust.

Complete code:

$pattern = '|\[lang:([a-z]{2})\]\R*(.*?)\R*\[endlang:\1\]|si';

preg_match_all($pattern, $value, $matches);

// Combine the languages and matched strings to create an associative array
$result = array_combine($matches[1], $matches[2]);

var_dump($result);

Demo

@priyan99 Also might want to modify, that it ends with in `$1` captured `endlang:\1` — Jonny 5, Aug 09 '14 at 14:23
@Jonny5: Yeah, that's a good idea. I've edited the answer to reflect this. Thanks! — Amal Murali, Aug 09 '14 at 14:28

score 1 · Answer 2 · answered Aug 09 '14 at 14:06

PHP regex will behave greedy by default. So your version will match the first opening tag and the very last closing tag. You can specify non-greedy behavior by adding ? to the corresponding part like so:

$count = preg_match_all('|\[lang:([a-z]{2})\](.*?)\[endlang:[a-z]{2}\]|si',$value,$matches);

That will make the expression select as few lines as possible in between the tags. I have just tested it and it seems to work.

How to use PHP preg_match_all, to obtain the contents inside a custom tag(not html),

2 Answers2