get everything between and with php

Question

I'm trying to grab a string within a string using regex.

I've looked and looked but I can't seem to get any of the examples I have to work.

I need to grab the html tags <code> and </code> and everything in between them.

Then I need to pull the matched string from the parent string, do operations on both,

then put the matched string back into the parent string.

Here's my code:

$content = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. &lt;code>Donec sed erat vel diam ultricies commodo. Nunc venenatis tellus eu quam suscipit quis fermentum dolor vehicula.&lt;/code>"
$regex='';
$code = preg_match($regex, $text, $matches);

I've already tried these without success:

$regex = "/<code\s*(.*)\>(.*)<\/code>/";
$regex = "/<code>(.*)<\/code>/";

Tony the pony, he comes... [Have you tried an HTML parser?](http://stackoverflow.com/a/1732454/707111) — Ry-, Feb 12 '12 at 21:51
Not really, this is no different to parsing a single BBcode tag. There are no attributes on it, just a straight `(.*)` — Joe, Feb 12 '12 at 21:52
@minitech To parse a single BBcode tag? Sounds like a perfect situation for regex, no point getting Pear involved for something so simple. — Joe, Feb 12 '12 at 21:55
@Joe: What if the single BBCode tag is surrounded in `[NOPARSE]`? What then? Malformed attributes? Custom extensions? AAGH! — Ry-, Feb 12 '12 at 22:00
to summarize, usually you wouldn't want to use regex to parse html as it has a complicated nested structure which regex was not built for parsing, but if it's just a one off with source that you can predict will be what you expect it to be, go for it. — dqhendricks, Feb 12 '12 at 22:01
I'm just taking a wild guess here, but given what he's asking, I don't think you need to worry about `[NOPARSE]` or anything silly: it's just "how do I match anything between THIS string literal and THAT string literal", where the string literals just happen to be XML tags. There's no additional variation possible on the string literals, so no point over-complicating it. — Joe, Feb 12 '12 at 22:02
Why did you post the EXACT same question just 30minutes earlier? — Peter Svensson, Feb 12 '12 at 22:12
This is a duplicate of your previous question http://stackoverflow.com/questions/9252793/regex-to-match-a-string-in-a-string — Sam Greenhalgh, Feb 12 '12 at 22:35
that one didn't work, whic is this: $var = 'textwordasdf'; and this: $regex='#(.*?)#'; $code = preg_match($regex, $text, $matches); print $code; if ($code == 1) { — Nate, Feb 12 '12 at 22:40
I posted the same thing twice in a short time because the older post had a lot views and no answers (more so than most answered questions). But you're right, next time I'll be more patient, change the exmple, or use a combination of forums. — Nate, Feb 12 '12 at 22:46
can a parser be installed in and used by a website? I'm looking to do this inline with a webpage. I'm not trying to code here, I'm tring to format database content and display it on a webpage — Nate, Feb 12 '12 at 22:56
Please explain where you did you get the Var `$text` from? Your code in INVALID!! And no one saw that LOL :D — Cyborg, Jun 11 '17 at 20:50

score 35 · Answer 1 · answered Feb 17 '12 at 01:23

35

You can use the following:

$regex = '#<\s*?code\b[^>]*>(.*?)</code\b[^>]*>#s';

\b ensures that a typo (like <codeS>) is not captured.
The first pattern [^>]* captures the content of a tag with attributes (eg a class).
Finally, the flag s capture content with newlines.

See the result here : http://lumadis.be/regex/test_regex.php?id=1081

answered Feb 17 '12 at 01:23

piouPiouM

4,472
17
22

It doesn't work for tags that have no end tag, though. For example, it doesn't work on ``. – John Slegers Mar 14 '14 at 17:39
4

This is completely natural because the question is to capture the contents of `...` tags and not to capture a self-closing tag (which does not concern the `` tag). – piouPiouM Mar 14 '14 at 23:40
instead of select whole text between the tags, how to select only selected charcters ? – jay Apr 18 '18 at 12:57

score 27 · Answer 2 · answered Apr 06 '14 at 08:20

27

this function worked for me

<?php

function everything_in_tags($string, $tagname)
{
    $pattern = "#<\s*?$tagname\b[^>]*>(.*?)</$tagname\b[^>]*>#s";
    preg_match($pattern, $string, $matches);
    return $matches[1];
}

?>

answered Apr 06 '14 at 08:20

moni as

947
6
12

can you explain why you're returning $matches[1] and not [0]? – ina Jan 16 '17 at 08:20
$matches[0] is outerhtml of code you pass. for getting innerHtml you should get $matches[1] – moni as Jan 17 '17 at 08:49
2

You must have to add this in return for safer way : `return isset($matches[1]) ? $matches[1] : false;` if tag not present then it will give error so. – Yatin Mistry Mar 25 '19 at 08:21
This works perfectly and should be the accepted answer, thanks! – SJTriggs Nov 26 '19 at 06:07
a tag work but it is not work with p tag – Võ Minh Jun 19 '20 at 09:02
tested with
Test P
content between p Tag
and worked like charm. can you send your code to check? – moni as Jun 20 '20 at 07:48
perfect solution sir :). – Omda Jun 24 '20 at 16:39

Joe · Answer 3 · 2012-02-12T21:59:09.037

24

$regex = '#<code>(.*?)</code>#';

Using # as the delimiter instead of / because then we don't need to escape the / in </code>

As Phoenix posted below, .*? is used to make the .* ("anything") match as few characters as possible before it comes across a </code> (known as a "non-greedy quantifier"). That way, if your string is

<code>hello</code> something <code>again</code>

you'll match hello and again instead of just matching hello</code> something <code>again.

edited Feb 12 '12 at 21:59

answered Feb 12 '12 at 21:53

Joe

15,062
4
38
77

7

Should it be `(.*?)` in case the string contains multiple `` tags (acknowledging that the example in the OP did not indicate this)? – Feb 12 '12 at 21:57
i like the greedy option – Alberto Feb 13 '12 at 00:47
I know this is way old... but... be very careful with the regex posted above. I have used this type of regex before on very large XML documents. Turn off the greedy setting, or you can wind up with catastrophic backtracking. – photocode Oct 20 '16 at 09:27
@Joe Why doesn't it work for string like `@github,@gcal-work,` (note the comma at the end)?......It is only picking the first tag...i.e @github. Any idea? – Khurshid Alam Nov 07 '16 at 15:32
@KhurshidAlam you probably need to add the `g` flag to make the regex "global" - ie, make it return all matches rather than just the first one. Simply add a lowercase letter `g` after the ending delimiter, eg. `#(.*?)#g` – Joe Nov 07 '16 at 15:35
No problem - if the answer helped, would you mind upvoting it? Rep adds up over time :-) – Joe Nov 07 '16 at 16:15

score 7 · Answer 4 · answered Feb 13 '12 at 00:45

7

you can use /<code>([\s\S]*)<\/code>/msU this catch NEWLINES too!

answered Feb 13 '12 at 00:45

Alberto

2,572
5
29
63

2

if you need a non-greedy option just put a question mark after the * '/([\s\S]*?)/msU' – Alberto Feb 13 '12 at 00:48
I thought you liked the greedy option, stop touting non-greedy too :P – Joe Feb 14 '12 at 07:26

score 1 · Answer 5 · answered Feb 23 '12 at 17:09

function contentDisplay($text)
{
    //replace UTF-8
    $convertUT8 = array("\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6");
    $to = array("'", "'", '"', '"', '-', '--', '...');
    $text = str_replace($convertUT8,$to,$text);

    //replace Windows-1252
    $convertWin1252 = array(chr(145), chr(146), chr(147), chr(148), chr(150), chr(151), chr(133));
    $to = array("'", "'", '"', '"', '-', '--', '...');
    $text = str_replace($convertWin1252,$to,$text);

    //replace accents
    $convertAccents = array('À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'Ð', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', '?', '?', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', '?', '?', 'L', 'l', 'N', 'n', 'N', 'n', 'N', 'n', '?', 'O', 'o', 'O', 'o', 'O', 'o', 'Œ', 'œ', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'Š', 'š', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Ÿ', 'Z', 'z', 'Z', 'z', 'Ž', 'ž', '?', 'ƒ', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', '?', '?', '?', '?', '?', '?');
    $to = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o');
    $text = str_replace($convertAccents,$to,$text);

    //Encode the characters
    $text = htmlentities($text);

    //normalize the line breaks (here because it applies to all text)
    $text = str_replace("\r\n", "\n", $text);
    $text = str_replace("\r", "\n", $text);

    //decode the <code> tags
    $codeOpen = htmlentities('<').'code'.htmlentities('>');
    if (strpos($text, $codeOpen))
    {
        $text = str_replace($codeOpen, html_entity_decode(htmlentities('<')) . "code" . html_entity_decode(htmlentities('>')), $text);
    }
    $codeOpen = htmlentities('<').'/code'.htmlentities('>');
    if (strpos($text, $codeOpen))
    {
        $text = str_replace($codeOpen, html_entity_decode(htmlentities('<')) . "/code" . html_entity_decode(htmlentities('>')), $text);
    }

    //match everything between <code> and </code>, the msU is what makes this work here, ADD this to REGEX archive
    $regex = '/<code>(.*)<\/code>/msU';
    $code = preg_match($regex, $text, $matches);
    if ($code == 1)
    {
        if (is_array($matches) && count($matches) >= 2)
        {
            $newcode = $matches[1];

            $newcode = nl2br($newcode);
        }

    //remove <code>and this</code> from $text;
    $text = str_replace('<code>' . $matches[1] . '</code>', 'PLACEHOLDERCODE1', $text);

    //convert the line breaks to paragraphs
    $text = '<p>' . str_replace("\n\n", '</p><p>', $text) . '</p>';
    $text = str_replace("\n" , '<br />', $text);
    $text = str_replace('</p><p>', '</p>' . "\n\n" . '<p>', $text);

    $text = str_replace('PLACEHOLDERCODE1', '<code>'.$newcode.'</code>', $text);
    }
    else
    {
        $code = false;
    }

    if ($code == false)
    {
        //convert the line breaks to paragraphs
        $text = '<p>' . str_replace("\n\n", '</p><p>', $text) . '</p>';
        $text = str_replace("\n" , '<br />', $text);
        $text = str_replace('</p><p>', '</p>' . "\n\n" . '<p>', $text);
    }

    return $text;
}

This whopping piece of code should contain some explanation. — Andrea Lazzarotto, Jan 14 '17 at 23:59

score 0 · Answer 6 · answered Feb 23 '19 at 07:21

0

You can also try:

function getTagValue($string, $tag)
{
    $pattern = "/<{$tag}>(.*?)<\/{$tag}>/s";
    preg_match($pattern, $string, $matches);
    return isset($matches[1]) ? $matches[1] : '';
}

It returns empty string in case of no match.

answered Feb 23 '19 at 07:21

Milind Singh

237
3
19

score -1 · Answer 7 · answered Jun 14 '20 at 07:57

To retrieve or delete the content of a script tag, even with special cases like <script async>.

$str = '
Some js embed
<script async>
  alert("js")
  let job, origin = new Date().getTime()
</script>
<span id="OUT"></span>
<button onclick="alert()">RESET</button>
timer experiment
';

$reg = '/<script([\s\S]*)<\/script>/';

preg_match($reg, $str, $matches);
$match = substr($matches[0], (strpos($matches[0], ">")+1));
$match = str_replace("</script>", "", $match);
echo $match;
/* OUTPUT
  alert("js")
  let job, origin = new Date().getTime()
*/
echo "\n---------------------\n";
echo preg_replace($reg, "DELETED", $str);
/* OUTPUT
  Some js embed
  DELETED
  <span id="OUT"></span>
  <button onclick="alert()">RESET</button>
  timer experiment
*/

This is not answering the OP's specific question. This answer belongs on a different question. Furthermore, a DOM parser should be used, not regex, when interrogating an html document. — mickmackusa, Jun 14 '20 at 08:00

get everything between and with php

7 Answers7

Linked

Related