2

I'm trying to parse some data from a .X12 file using regular expressions in PHP. The pattern is the capital letters FS followed by exactly 13 numeric characters.

Here is an example of some text: PROCUREMENT333RFQ3PO100011EAFS8340015381823PKGFSHALL

I need to extract 'FS8340015381823' and other variations of the 13 numeric characters from other files.

Here is the code that is not working for me:

$regex = '/FS[0-9]{13}?/';
preg_match( $regex, $x12, $matches );
var_dump( $matches );

I've also tried these regex patterns:

$regex = '/FS8340015381823/';
$regex = '/FS\d{13}?/';

All of these regex's work fine if I store the example string above to the $x12 variable before doing a preg_match(), but they don't work on the raw file when I load the contents. When I echo the .X12 file contents to screen, I see the exact string that I have used as an example above. If I use the regex /FS/, it finds the 'FS'.

This regexc works on the raw file data, but returns matches that aren't just numeric characters after the 'FS':

$regex = '/FS.{13}?/';

Could there be strange characters that the terminal on my machine is not displaying? I'm running Linux CentOS on an Amazon EC2.

T. Brian Jones
  • 11,630
  • 19
  • 67
  • 109
  • remove the ? from your regex ... this makes the match optional. – Orangepill Aug 14 '13 at 22:38
  • 1
    I would check it with a hexeditor, that way you'll be sure to not oversee some wild characters. BTW what's the results of `/FS.{13}/s`? – HamZa Aug 14 '13 at 22:38
  • _"Could there be strange characters that the terminal on my machine is not displaying?"_ => open in your favorite HEX editor. – Wrikken Aug 14 '13 at 22:47
  • @HamZa - `/FS.{13}/s` actually works on the raw file data, but it finds a bunch of other matches that aren't just 13 numeric characters. I can work around this by then processing that outside the regex, but that seems silly. – T. Brian Jones Aug 14 '13 at 22:54
  • @T.BrianJones My point was if there were digits or not in the output. If you can provide us with a small dump we could maybe help you further. But generally you would open a hex editor and check the exact characters, from there on you can build a reliable regex. – HamZa Aug 14 '13 at 22:56
  • @HamZa - There are digits in the output. Honestly, I'm not sure how to provide a dump. I'm loading these files off S3 and when I try to write the files to disk using `file_put_contents()` they become binary files. I'm in the process of setting up a hex editor to test with. – T. Brian Jones Aug 14 '13 at 22:57
  • @T.BrianJones `echo base64_encode(file_get_content('path/to'));` and paste on http://pastebin.com of course that is if there is no sensitive information ... – HamZa Aug 14 '13 at 23:03
  • @HamZa - Great point, I actually can't paste a full file ( sensitive data ). I've converted the file to hex and found a `1d` character between the 'FS' and the digits ... which seems to be a 'Group Separator' character. What do I do with those? How do I remove them / deal with them? – T. Brian Jones Aug 14 '13 at 23:15
  • @HamZa, Why were you saying `[^\W_]`? `\w{2}` was for matching two proceeding letters. You can also just use `\D` for matching non digits – hwnd Aug 14 '13 at 23:16
  • @hwnd `\w` will match letters, digits and underscore. Note that it will also letters like `éë` and much more. Now `\W` (note the uppercase) behaves the opposite way of `\w`. Now here's the trick if we put that in a negative character class it will behave as `\w`: `[^\W]`. Now let's exclude the underscore `[^\W_]`. `\D` will match anything that's not a digit, which means even dash, points, linebreak etc... while `[^\W_]` doesn't. It was just a trick I wanted to show. – HamZa Aug 14 '13 at 23:22
  • @HamZa, thanks for the feedback. – hwnd Aug 14 '13 at 23:23
  • 2
    @T.BrianJones Try something like `/FS\x1d[0-9]{13}/`. Note that you don't need `?`. A question mark after a quantifier means "make it ungreedy" ie:lazy. Regexes are by nature greedy. But wait ? You are specifying `match 13 times` which means lazy or greedy it will act the same. So we could just remove it. See this [answer](http://stackoverflow.com/questions/3075130/difference-between-and-for-regex/3075532#3075532) for more information. – HamZa Aug 14 '13 at 23:26
  • @hwnd you're welcome :) – HamZa Aug 14 '13 at 23:42

1 Answers1

2

Thanks to the help of @HamZa and the OP for breaking down his data.

You can use /FS\x1d[0-9]{13}/ or /FS\x1d\d{13}/

If you have multiple hex in your data, you can use a character class.

/FS[\x00-\x1f]\d{13}/ 
Community
  • 1
  • 1
hwnd
  • 65,661
  • 4
  • 77
  • 114
  • 2
    To expand on this answer, the problem with the original file contents was that it contained a character that was invisible when outputting to screen in terminal, or viewing in a standard text editor. I had to convert the file contents to hex ( `bin2hex()` in PHP ) to see this character. It was a Group Separator character, `1d` in hex and `29` in ascii. This GD character exists between the 'FS' and the digits I was searching. – T. Brian Jones Aug 15 '13 at 01:12