2

I have searched high and low and am really struggling to find the appropriate REGEX that would help me retrieve what I want. Assume I have the following query string:

SELECT col, col2 AS c2, col3, col * col2 calc FROM...

I want a REGEX that will pull out everything between commas where there is a space or an "AS" that proceeds a column alias. It should also exclude the SELECT and FROM. With the example above, I would want the following matches:

col2 AS c2
col * col2 calc

Essentially, this should remove col and col3 because they don't have aliases. I have tried variations of the following but have been unsuccessful (Note: $colname contains the alias):

/(SELECT|,)([^,]*{$colname}[^,]*)(FROM|,)/

Currently this only returns:

, col2 AS c2,
col2 AS c2
, col2 AS c2,
col2 AS c2

I feel I am close but I need some guidance. Any suggestions are appreciated.

Dustin
  • 113
  • 1
  • 9

1 Answers1

2

It is next to impossible to handle SQL string parsing with one regex, you should look for a more generic solution like PHP-SQL-Parser.

The regex I have come up with is an ad-hoc work around that can be further precised, adjusted, etc. to fit the needs of a specific project:

(?:\G(?!\A),|SELECT)(?:\s+[^\s,]+(?:\([^()]*\))?,)*\s*\K[^\s,]+\s*(?:(?![()',])[[:punct:]]|AS)\s*(?:(?!FROM)[^,])*(?!\w)

See the regex demo.

Here is what it does:

  • (?:\G(?!\A),|SELECT) - the end of the previous successful match and a comma after (\G(?!\A),) or (|) a literal SELECT substring
  • (?:\s+[^\s,]+(?:\([^()]*\))?,)* - zero or more sequences ((?:...)*) of:
    • \s+ - 1+ whitespaces
    • [^\s,]+ - 1+ chars other than whitespace and a comma
    • (?:\([^()]*\))? - an optional substring like (...) (a (, then 0+ chars other than ( and ), and then a ))
    • , - a comma
  • \s* - 0+ whitespace chars
  • \K - a match reset operator discarding the text matched so far (so, SELECT abc, Contains(...), will get omitted from the match, but will be required for the match)
  • [^\s,]+ - 1 or more chars other than whitespace and ,
  • \s* - 0+ whitespace chars
  • (?:(?![()',])[[:punct:]]|AS) - either
    • (?![()',])[[:punct:]] - any punctuation or symbol char but a (, ), ' and ,
    • | - or
    • AS - a literal AS substring
  • \s* - 0+ whitespaces
  • (?:(?!FROM)[^,])* - zero or more chars other than a comma not starting a FROM char sequence (it is a tempered greedy token)
  • (?!\w) - not followed with a word char.
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397