3

i need to extract the date and time with reg exp, but doesn't work, i don't know why?

    <tr>
        <td align="center">13.44.333-3</td>
        <td align="center">asdf3</td>
        <td align="center">15/01/2016 00:22:16</td>
        <td align="center">$ 1531</td>
    </tr>
 <tr>
        <td align="center">13.333.333-3</td>
        <td align="center">asdf3</td>
        <td align="center">16/01/2016 00:22:16</td>
        <td align="center">$ 1531</td>
    </tr>
 <tr>
        <td align="center">13.333.333-3</td>
        <td align="center">asdf3</td>
        <td align="center">11/01/2015 00:22:16</td>
        <td align="center">$ 1531</td>
    </tr>

the reg exp what i use:

preg_match_all("/<td align=\"center\"\>[\s]*([^\s\<\/]*)<\/td>[\s]*<td align=\"center\"\>/is",$content, $matches, null, 0);

the result is : 11/01/2016

but i need this: 11/01/2016 11:59:49

i don't know that i'm doing wrong.

the result what i need is:

array (
  0 => 
  array (
    0 => '<td align="center">15/01/2016 00:22:16</td>
        <td align="center">',
    1 => '<td align="center">11/01/2015 00:22:16</td>
        <td align="center">',
  ),
  1 => 
  array (
    0 => '15/01/2016 00:22:16',
    1 => '11/01/2015 00:22:16',
  ),
)
Cœur
  • 32,421
  • 21
  • 173
  • 232

3 Answers3

1

Here's a parser/regex approach:

$html = '<tr>
                            <td align="center">13.333.333-3</td>
                            <td align="center">asdf3</td>
                            <td align="center">15/01/2016 00:22:16</td>
                            <td align="center">$ 1531</td>
                        </tr>';
$thedoc = new DOMDocument();
$thedoc->loadHTML($html);
$cells = $thedoc->getElementsByTagName('td');
foreach($cells as $cell){
    if(preg_match('~^(\d{2}/\d{2}/\d{4})\h(\d{2}:\d{2}:\d{2})$~', $cell->nodeValue, $matches)) {
         echo 'Date:' . $matches[1] . ' Time:'. $matches[2];
    }
}

PHP Demo: https://eval.in/515935
Regex101 Demo: https://regex101.com/r/sT2hD9/1

This also would allow invalid times/dates but they would have to be formatted correctly e.g. 22/22/2222 25:61:62. Depending on requirements you could make it work, also could make parts (seconds) optional, if needed. You also could group the day, month, year, hours, minutes, and seconds all separately.

chris85
  • 23,255
  • 7
  • 28
  • 45
1

It is considered better to parse HTML with a proper DOM parser than to use regular expressions on it, so I'll give that solution first:

1. With DOMDocument

Use DOMDocument in combination with DOMXPath for this.

Here is code that only gets the content of the third column, which contains date/times:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$elements = $xpath->query('//td[3]');
$matches = array_map(function($td) {
    return $td->textContent;
}, iterator_to_array($elements));

This code will do an XPath query, finding td elements in the given HTML, that are the third child of their respective parent (tr), and then it maps the text content of each found td into an array.

If the $html variable has this string:

<table width="100%" border="0" cellspacing="0" cellpadding="0" id="facturas">
<tr>
    <td align="center">13.44.333-3</td>
    <td align="center">asdf3</td>
    <td align="center">15/01/2016 00:22:16</td>
    <td align="center">$ 1531</td>
 </tr>
 <tr>
    <td align="center">13.333.333-3</td>
    <td align="center">asdf3</td>
    <td align="center">16/01/2016 00:22:16</td>
    <td align="center">$ 1531</td>
 </tr>
 <tr>
    <td align="center">13.333.333-3</td>
    <td align="center">asdf3</td>
    <td align="center">11/01/2015 00:22:16</td>
    <td align="center">$ 1531</td>
</tr>
</table>

Then $matches will be the following array:

array (
  '15/01/2016 00:22:16',
  '16/01/2016 00:22:16',
  '11/01/2015 00:22:16',
)

See the code run with output on eval.in.

Some alternative XPath queries:

If the $html could have other tables, you should limit the search to the table of interest, e.g. with id equal to facturas:

//*[@id="facturas"]//td[3]

To make sure each matched td has the align attribute set to "center":

//td[@align="center"]

To find elements that have a specific text, like "/2016":

//td[contains(., "/2016")]

2. With a Regular Expression

Although not advised, you could use a regular expression.

If you still want to go for this, then use this code:

preg_match_all("/<td[^>]*\>\s*(\d\d\/\d\d\/\d{4}\b[^<]*)<\/td\s*>/mis",
               $html, $matches);

This will match td elements that contain a value that starts with text in the format "99/99/9999" (9 can be any digit).

Now $matches will be:

array (
  0 => 
  array (
    0 => '<td align="center">15/01/2016 00:22:16</td>',
    1 => '<td align="center">16/01/2016 00:22:16</td>',
    2 => '<td align="center">11/01/2015 00:22:16</td>',
  ),
  1 => 
  array (
    0 => '15/01/2016 00:22:16',
    1 => '16/01/2016 00:22:16',
    2 => '11/01/2015 00:22:16',
  ),
)

See the code run with output on eval.in

But note that in general text in HTML can have entities like &gt; (can be solved with html_entity_decode), or td elements can have <br> or other tags inside them (can sometimes be solved with strip_tags), or tag attributes can have values that contain HTML, which could trick the regular expression. The same goes for script tags, which may have JavaScript that contains HTML strings in variables.

These are just examples. The list of things that can make such a regular expression go wrong is long. All of this is never a problem when using the DOM parser, but with regular expressions it is near impossible to get the right for all possible cases.

Solution 1 is therefore the one to go for.

Community
  • 1
  • 1
trincot
  • 211,288
  • 25
  • 175
  • 211
  • Yes, but it is not clear to me whether the OP only wants the third element's value. The `preg_match_all` statement provided in the question does not even return the date, only the first elements content. Probably an error in encoding the question. I think the intention was to get all four elements. – trincot Feb 09 '16 at 23:00
0

Have you find a solution yet, I wish to help.

<?php

$html=<<<HEREDOC
  <tr>
    <td align="center">13.44.333-3</td>
    <td align="center">asdf3</td>
    <td align="center">15/01/2016 00:22:16</td>
    <td align="center">$ 1531</td>
</tr>
<tr>
    <td align="center">13.333.333-3</td>
    <td align="center">asdf3</td>
    <td align="center">16/01/2016 00:22:16</td>
    <td align="center">$ 1531</td>
</tr>
 <tr>
    <td align="center">13.333.333-3</td>
    <td align="center">asdf3</td>
    <td align="center">11/01/2015 00:22:16</td>
    <td align="center">$ 1531</td>
</tr>
HEREDOC;

if(preg_match_all('~<td\s+[^>]*>((?:\d+(?:\/\d+){2})\s+(?:\d+(?:\:\d+){2}))<\/td>~mi',$html,$matchall)){
    print_r($matchall);
}
?>

Output will be

Array
(
[0] => Array
    (
        [0] => <td align="center">15/01/2016 00:22:16</td>
        [1] => <td align="center">16/01/2016 00:22:16</td>
        [2] => <td align="center">11/01/2015 00:22:16</td>
    )

[1] => Array
    (
        [0] => 15/01/2016 00:22:16
        [1] => 16/01/2016 00:22:16
        [2] => 11/01/2015 00:22:16
    )

)
amachree tamunoemi
  • 763
  • 2
  • 10
  • 29