31

I'm trying to grab a string within a string using regex.

I've looked and looked but I can't seem to get any of the examples I have to work.

I need to grab the html tags <code> and </code> and everything in between them.

Then I need to pull the matched string from the parent string, do operations on both,

then put the matched string back into the parent string.

Here's my code:

$content = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. &lt;code>Donec sed erat vel diam ultricies commodo. Nunc venenatis tellus eu quam suscipit quis fermentum dolor vehicula.&lt;/code>"
$regex='';
$code = preg_match($regex, $text, $matches);

I've already tried these without success:

$regex = "/<code\s*(.*)\>(.*)<\/code>/";
$regex = "/<code>(.*)<\/code>/";
Cœur
  • 32,421
  • 21
  • 173
  • 232
Nate
  • 531
  • 1
  • 7
  • 14
  • 2
    Tony the pony, he comes... [Have you tried an HTML parser?](http://stackoverflow.com/a/1732454/707111) – Ry- Feb 12 '12 at 21:51
  • 1
    Not really, this is no different to parsing a single BBcode tag. There are no attributes on it, just a straight `(.*)` – Joe Feb 12 '12 at 21:52
  • 1
    @minitech To parse a single BBcode tag? Sounds like a perfect situation for regex, no point getting Pear involved for something so simple. – Joe Feb 12 '12 at 21:55
  • @Joe: What if the single BBCode tag is surrounded in `[NOPARSE]`? What then? Malformed attributes? Custom extensions? AAGH! – Ry- Feb 12 '12 at 22:00
  • to summarize, usually you wouldn't want to use regex to parse html as it has a complicated nested structure which regex was not built for parsing, but if it's just a one off with source that you can predict will be what you expect it to be, go for it. – dqhendricks Feb 12 '12 at 22:01
  • 2
    I'm just taking a wild guess here, but given what he's asking, I don't think you need to worry about `[NOPARSE]` or anything silly: it's just "how do I match anything between THIS string literal and THAT string literal", where the string literals just happen to be XML tags. There's no additional variation possible on the string literals, so no point over-complicating it. – Joe Feb 12 '12 at 22:02
  • 1
    Why did you post the EXACT same question just 30minutes earlier? – Peter Svensson Feb 12 '12 at 22:12
  • This is a duplicate of your previous question http://stackoverflow.com/questions/9252793/regex-to-match-a-string-in-a-string – Sam Greenhalgh Feb 12 '12 at 22:35
  • that one didn't work, whic is this: $var = 'textwordasdf'; and this: $regex='#(.*?)#'; $code = preg_match($regex, $text, $matches); print $code; if ($code == 1) { – Nate Feb 12 '12 at 22:40
  • I posted the same thing twice in a short time because the older post had a lot views and no answers (more so than most answered questions). But you're right, next time I'll be more patient, change the exmple, or use a combination of forums. – Nate Feb 12 '12 at 22:46
  • can a parser be installed in and used by a website? I'm looking to do this inline with a webpage. I'm not trying to code here, I'm tring to format database content and display it on a webpage – Nate Feb 12 '12 at 22:56
  • Please explain where you did you get the Var `$text` from? Your code in INVALID!! And no one saw that LOL :D – Cyborg Jun 11 '17 at 20:50

7 Answers7

35

You can use the following:

$regex = '#<\s*?code\b[^>]*>(.*?)</code\b[^>]*>#s';
  • \b ensures that a typo (like <codeS>) is not captured.
  • The first pattern [^>]* captures the content of a tag with attributes (eg a class).
  • Finally, the flag s capture content with newlines.

See the result here : http://lumadis.be/regex/test_regex.php?id=1081

piouPiouM
  • 4,472
  • 17
  • 22
  • It doesn't work for tags that have no end tag, though. For example, it doesn't work on ``. – John Slegers Mar 14 '14 at 17:39
  • 4
    This is completely natural because the question is to capture the contents of `...` tags and not to capture a self-closing tag (which does not concern the `` tag). – piouPiouM Mar 14 '14 at 23:40
  • instead of select whole text between the tags, how to select only selected charcters ? – jay Apr 18 '18 at 12:57
27

this function worked for me

<?php

function everything_in_tags($string, $tagname)
{
    $pattern = "#<\s*?$tagname\b[^>]*>(.*?)</$tagname\b[^>]*>#s";
    preg_match($pattern, $string, $matches);
    return $matches[1];
}

?>
moni as
  • 947
  • 6
  • 12
24
$regex = '#<code>(.*?)</code>#';

Using # as the delimiter instead of / because then we don't need to escape the / in </code>

As Phoenix posted below, .*? is used to make the .* ("anything") match as few characters as possible before it comes across a </code> (known as a "non-greedy quantifier"). That way, if your string is

<code>hello</code> something <code>again</code>

you'll match hello and again instead of just matching hello</code> something <code>again.

Joe
  • 15,062
  • 4
  • 38
  • 77
  • 7
    Should it be `(.*?)` in case the string contains multiple `` tags (acknowledging that the example in the OP did not indicate this)? –  Feb 12 '12 at 21:57
  • i like the greedy option – Alberto Feb 13 '12 at 00:47
  • I know this is way old... but... be very careful with the regex posted above. I have used this type of regex before on very large XML documents. Turn off the greedy setting, or you can wind up with catastrophic backtracking. – photocode Oct 20 '16 at 09:27
  • @Joe Why doesn't it work for string like `@github,@gcal-work,` (note the comma at the end)?......It is only picking the first tag...i.e @github. Any idea? – Khurshid Alam Nov 07 '16 at 15:32
  • @KhurshidAlam you probably need to add the `g` flag to make the regex "global" - ie, make it return all matches rather than just the first one. Simply add a lowercase letter `g` after the ending delimiter, eg. `#(.*?)#g` – Joe Nov 07 '16 at 15:35
  • No problem - if the answer helped, would you mind upvoting it? Rep adds up over time :-) – Joe Nov 07 '16 at 16:15
7

you can use /<code>([\s\S]*)<\/code>/msU this catch NEWLINES too!

Alberto
  • 2,572
  • 5
  • 29
  • 63
1
function contentDisplay($text)
{
    //replace UTF-8
    $convertUT8 = array("\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6");
    $to = array("'", "'", '"', '"', '-', '--', '...');
    $text = str_replace($convertUT8,$to,$text);

    //replace Windows-1252
    $convertWin1252 = array(chr(145), chr(146), chr(147), chr(148), chr(150), chr(151), chr(133));
    $to = array("'", "'", '"', '"', '-', '--', '...');
    $text = str_replace($convertWin1252,$to,$text);

    //replace accents
    $convertAccents = array('À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'Ð', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', '?', '?', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', '?', '?', 'L', 'l', 'N', 'n', 'N', 'n', 'N', 'n', '?', 'O', 'o', 'O', 'o', 'O', 'o', 'Œ', 'œ', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'Š', 'š', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Ÿ', 'Z', 'z', 'Z', 'z', 'Ž', 'ž', '?', 'ƒ', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', '?', '?', '?', '?', '?', '?');
    $to = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o');
    $text = str_replace($convertAccents,$to,$text);

    //Encode the characters
    $text = htmlentities($text);

    //normalize the line breaks (here because it applies to all text)
    $text = str_replace("\r\n", "\n", $text);
    $text = str_replace("\r", "\n", $text);

    //decode the <code> tags
    $codeOpen = htmlentities('<').'code'.htmlentities('>');
    if (strpos($text, $codeOpen))
    {
        $text = str_replace($codeOpen, html_entity_decode(htmlentities('<')) . "code" . html_entity_decode(htmlentities('>')), $text);
    }
    $codeOpen = htmlentities('<').'/code'.htmlentities('>');
    if (strpos($text, $codeOpen))
    {
        $text = str_replace($codeOpen, html_entity_decode(htmlentities('<')) . "/code" . html_entity_decode(htmlentities('>')), $text);
    }

    //match everything between <code> and </code>, the msU is what makes this work here, ADD this to REGEX archive
    $regex = '/<code>(.*)<\/code>/msU';
    $code = preg_match($regex, $text, $matches);
    if ($code == 1)
    {
        if (is_array($matches) && count($matches) >= 2)
        {
            $newcode = $matches[1];

            $newcode = nl2br($newcode);
        }

    //remove <code>and this</code> from $text;
    $text = str_replace('<code>' . $matches[1] . '</code>', 'PLACEHOLDERCODE1', $text);

    //convert the line breaks to paragraphs
    $text = '<p>' . str_replace("\n\n", '</p><p>', $text) . '</p>';
    $text = str_replace("\n" , '<br />', $text);
    $text = str_replace('</p><p>', '</p>' . "\n\n" . '<p>', $text);

    $text = str_replace('PLACEHOLDERCODE1', '<code>'.$newcode.'</code>', $text);
    }
    else
    {
        $code = false;
    }

    if ($code == false)
    {
        //convert the line breaks to paragraphs
        $text = '<p>' . str_replace("\n\n", '</p><p>', $text) . '</p>';
        $text = str_replace("\n" , '<br />', $text);
        $text = str_replace('</p><p>', '</p>' . "\n\n" . '<p>', $text);
    }

    return $text;
}
Nate
  • 531
  • 1
  • 7
  • 14
0

You can also try:

function getTagValue($string, $tag)
{
    $pattern = "/<{$tag}>(.*?)<\/{$tag}>/s";
    preg_match($pattern, $string, $matches);
    return isset($matches[1]) ? $matches[1] : '';
}

It returns empty string in case of no match.

Milind Singh
  • 237
  • 3
  • 19
-1

To retrieve or delete the content of a script tag, even with special cases like <script async>.

$str = '
Some js embed
<script async>
  alert("js")
  let job, origin = new Date().getTime()
</script>
<span id="OUT"></span>
<button onclick="alert()">RESET</button>
timer experiment
';

$reg = '/<script([\s\S]*)<\/script>/';

preg_match($reg, $str, $matches);
$match = substr($matches[0], (strpos($matches[0], ">")+1));
$match = str_replace("</script>", "", $match);
echo $match;
/* OUTPUT
  alert("js")
  let job, origin = new Date().getTime()
*/
echo "\n---------------------\n";
echo preg_replace($reg, "DELETED", $str);
/* OUTPUT
  Some js embed
  DELETED
  <span id="OUT"></span>
  <button onclick="alert()">RESET</button>
  timer experiment
*/
NVRM
  • 6,477
  • 1
  • 51
  • 60
  • This is not answering the OP's specific question. This answer belongs on a different question. Furthermore, a DOM parser should be used, not regex, when interrogating an html document. – mickmackusa Jun 14 '20 at 08:00