1

I'm looking to find all fee codes in a page. The codes are 5 digits, with an optional single letter at the beginning. I have this currently, which is working great.

preg_match_all("/\b([a-zA-Z])?\d{5}\b/", $content, $matches);

My problem is I need to exclude any that occur within the 'title' attribute of a link.

<a href="#" title="Sample Fee – also see B11023">G14015</a>

I want to match on the G14015, but not B11023.

Any suggestions? Much appreciated.

Andy Lester
  • 81,480
  • 12
  • 93
  • 144
  • Do you want to exclude all fee codes that occur (as an attribute) within a tag? That would make it easier, especially if they occur as in your example - right between two tags. – SQB Apr 09 '14 at 20:09
  • The codes I want to match will either be as the G14015 is, the text of a link or just as plain text in the body of the page copy. So, as above or "blah blah blah 12233 blah blah G18828 blah blah". Excluding everything within "" would work fine. – Sandbox Wizard Apr 09 '14 at 20:12
  • The problem is that PHP does not allow lookbehind of arbitrary length. In other words, when you've found a fee code, you can't look back to see if you're _somewhere_ within an attribute. – SQB Apr 09 '14 at 20:14
  • 1
    **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Apr 09 '14 at 21:07
  • @AndyLester I don’t disagree with your advice; [it’s my own as well](http://stackoverflow.com/q/4231382/471272). It’s good to remember that while all things are possible, not all are expedient — and using patterns to match generic rather than specific HTML is one of those. – tchrist Jun 08 '14 at 20:17

2 Answers2

0

Based on your comments, clarifying that the fee codes are never found within a tag, I'd suggest a two pass solution. First, remove all tags by replacing them with a single space. Then process that to find the fee codes.

$content = preg_replace("/<[^>]+>/", " ", $content);
preg_match_all("/\b[A-Za-z]\d{5}\b/", $content, $matches);

This assumes no stray < or > is present.


Of course, the usual warning that one should not use regex to parse html or xml, applies.

Community
  • 1
  • 1
SQB
  • 3,583
  • 1
  • 24
  • 44
  • Thanks @SQB. I'm getting an error though `preg_replace(): Unknown modifier ']'` on the first line. – Sandbox Wizard Apr 09 '14 at 20:58
  • Dont forget the / on the sides of ]+> – Arbitur Apr 09 '14 at 21:03
  • @Arbitur Thanks for catching that, I'll edit it in. Feel free to edit things like that in yourself, in the future — you're not just allowed, but actively encouraged to. – SQB Apr 11 '14 at 06:05
  • @SQB Yeah I wanted to edit it myself but I had to add 10 more characters as-well and didn't know what to add or change :) – Arbitur Apr 11 '14 at 08:05
0

PHP had (*SKIP)(*FAIL) Magic

Resurrecting this question because it had a simple solution that wasn't mentioned. This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

With all the warnings about using regex to parse html, here is a simple way to do it.

We can solve it with one single and simple regex:

(?i)<[^>]+(*SKIP)(*F)|[a-z]?\d{5}

See demo.

The left side of the alternation | matches complete <tags> then deliberately fails, after which the engine skips to the next position in the string. The right side matches the pattern you want, and we know they are the right ones because they were not matched by the expression on the left.

Sample Code

$regex = '~(?i)<[^>]+(*SKIP)(*F)|[a-z]?\d{5}~';
preg_match_all($regex, $yourstring, $matches);
print_r($matches[0]);

Reference

Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97