5

I have a system that is causing errors when users use a semicolon in a free format field. I have traced it down to a simple explode statement:

$array = explode( ";", $sql );

Because this line is in a subroutine that is called from all over the system I would like to replace this line with something that will split things properly, without breaking the rest of the system. I thought I was onto a winner with str_getcsv, but that isn't sophisticated enough either. Look at the following example

$sql = "BEGIN;INSERT INTO TABLE_A (a, b, c) VALUES('42', '12', '\'ab\'c; DEF');INSERT INTO TABLE_B (d, e, f) VALUES('42', '43', 'XY\'s Z ;uvw') ON DUPLICATE KEY UPDATE f='XY\'s Z ;uvw';COMMIT;";

$array = str_getcsv($sql, ";", "'");
foreach( $array as $value ) {
    echo $value . "<br><br>";
}

When I run this is outputs the following:

BEGIN

INSERT INTO TABLE_A (a, b, c) VALUES('42', '12', '\'ab\'c

DEF')

INSERT INTO TABLE_B (d, e, f) VALUES('42', '43', 'XY\'s Z

uvw') ON DUPLICATE KEY UPDATE f='XY\'s Z

uvw'

COMMIT

So it doesn't notice the semicolons are inside quotes. (As far as I can see the quoted strings from different places in the system are always in single quotes, but it is possible that at times they are double quotes, I am not sure about that.)

Can anyone tell me how to do this? I suspect I can do this with a very complicated regex, but this is over my head.

  • Hey there, following up on this. Did one of the answers solve it for you, or is the question still there? Please give us some feedback. :) – zx81 Jun 27 '14 at 00:08

1 Answers1

8

(*SKIP)(*FAIL) Magic

This live PHP demo shows you the output of the two options below (with or without the semi-colon).

This is what you need:

$splits = preg_split('~\([^)]*\)(*SKIP)(*F)|;~', $sql);

See demo to see that we are splitting on the right semi-colons.

Output:

[0] => BEGIN
[1] => INSERT INTO TABLE_A (a, b, c) VALUES('42', '12', '\'ab\'c; DEF')
[2] => INSERT INTO TABLE_B (d, e, f) VALUES('42', '43', 'XY\'s Z ;uvw')
[3] => COMMIT
[4] =>

The empty item #4 is the match on the other side of the final ;. The other option is to keep the semi-colons (see below).

Option 2: Keep the Semi-Colons

If you want to keep the semi-colons, go with this:

$splits = preg_split('~\([^)]*\)(*SKIP)(*F)|(?<=;)(?![ ]*$)~', $sql);

Output:

[0] => BEGIN;
[1] => INSERT INTO TABLE_A (a, b, c) VALUES('42', '12', '\'ab\'c; DEF');
[2] => INSERT INTO TABLE_B (d, e, f) VALUES('42', '43', 'XY\'s Z ;uvw');
[3] => COMMIT;

Explanation

This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

In left side of the alternation |, the regex \([^)]*\) matches complete (parentheses) then deliberately fails, after which the engine skips to the next position in the string. The right side matches the ; word you want, and we know they are the right ones because they were not matched by the expression on the left. It is now safe to split on it.

In Option 2, where we keep the semi-colons, our match on the right matches a position, but no characters. That position is asserted by the lookbehind (?<=;), which asserts that a ; immediately precedes the position, and the negative lookahead (?![ ]*$), which asserts that what follows is not optional spaces then the end of the string (so we avoid a last empty match).

Sample Code

Please examine the live PHP demo.

Reference

Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97
  • This is just what I am looking for. It works a treat. Thanks.Will need to watch your live php demo sometime soon. – user3777863 Jun 27 '14 at 12:20
  • Just came across an other case where this doesn't work, namely when an SQL statement does have an 'ON DUPLICATE' section, in which case the arguments are not in brackets. Have updated the question to reflect this. – user3777863 Jun 30 '14 at 00:47
  • OK, I worked it out, the links you provided prove very helpful. If I change the regex to '~\([^)]*\)(*SKIP)(*F)|\'(?:\\\'|[^\'])*\'(*SKIP)(*F)|;~' it works. Maybe you can update it in your answer for other people looking at this answer. – user3777863 Jun 30 '14 at 01:02
  • OK,, hopefully the last update. The above regex works in regex101.com, but in actual php code it should have two extra slashes: '~([^)]*)(*SKIP)(*F)|\'(?:\\\\\'|[^\'])*\'(*SKIP)(*F)|;~' – user3777863 Jun 30 '14 at 01:22
  • 1
    For anyone else reading this who is not familiar with the `(*SKIP)(*FAIL)` dark magic, you can read about it in the [PCRE documentation](http://www.pcre.org/pcre.txt). – elixenide Jun 30 '14 at 01:27
  • 1
    @EdCottrell Thanks, Ed! And even more directly, see [Special Backtracking Control Verbs](http://perldoc.perl.org/perlre.html#Special-Backtracking-Control-Verbs) in the perl doc. Added to the references, great idea, thank you. :) – zx81 Jun 30 '14 at 03:12
  • @user3777863 Normally in PHP code, you're going to need way **fewer** backslashes. Regex101 forces escaping on `/` because they use it as a delimiter (lame) :) You can see on the actual php demo that the amount of escaping is minimal, right? I'm not able to look at backslash soup (can't focus) but if you give me your raw regex I'll tell you if anything needs escaping. :) – zx81 Jun 30 '14 at 03:17
  • My actual test script looks like this, and without the five slashes (which I don't like either) it doesn't work for me: $sql = "BEGIN;INSERT INTO TABLE_A (a, b, c) VALUES('42', '12', '\'ab\'c; DEF') ON DUPLICATE KEY UPDATE c='\'ab\'c; DEF';INSERT INTO TABLE_B (d, e, f) VALUES('42', '43', 'XY\'s Z ;uvw') ON DUPLICATE KEY UPDATE f='XY\'s Z ;uvw';COMMIT;"; $array = preg_split('~\([^)]*\)(*SKIP)(*F)|\'(?:\\\\\'|[^\'])*\'(*SKIP)(*F)|;~', $sql); echo $sql . "

    "; foreach( $array as $value ) { echo $value . "

    "; }
    – user3777863 Jul 02 '14 at 03:39
  • @zx81 What about statements that include delimiters? I've been trying to fork your regex to account for that, but I'm having difficulty implementing it with my novice experience with variation (*SKIP)(*F) expressions. DELIMITER $$ SELECT NOW()$$ DELIMITER ; -- etc. – Erutan409 Jul 20 '15 at 16:47
  • This works most of the time, but fails with SQL comments containing either `;` or `'`, which is present in e.g. SQL output from pgAdmin - for the most part, stripping lines starting with `--` before splitting will help, but things like comments on the same line, or block comments, will fail. – mindplay.dk May 26 '16 at 08:33
  • the first regex from the answer fails on: "create table test ( id bigint auto_increment primary key, test varchar(255) null comment 'NULL = all; otherwise use test names' )" - splits on a semicolon inside comment https://ideone.com/bb4alb – hawk Aug 18 '17 at 12:00