0

I've used Regex in Mysql to a limited extent to find various string patterns in MySql records (e.g Init Cap, All Caps). I'm trying to find a more complex pattern which also includes optional strings. I know if I simply searched on all possible patterns it would work but trying to see if RegEx is more powerful than that and would help me to at least reduce the # of patterns I need to search. Essentially the pattern is Digit Word Suffix.

1) The beginning of the pattern is always a # but anywhere from 1-5 digits. I'd prefer not to have to search on [0-9], then [0-9][0-9] etc. In other words some way to search on any digit string of 1 to 9999 (without leading zeros)

2) Between the leading digit and the suffix there can be up to three words. Again I'd prefer to not have to have a pattern match with 1,2 an 3 words. Clearly if I had to do this with each version of the digits I'd already have 4*3 patterns to search which seems inefficient

3) There is an optional modifier word that can be either after the middle word or after the suffix. As per above I'd prefer that to be indicated as such so that in essence I can simply make a regex that encompasses the following logic:

[Up to 4Digits] [Optional Modifier from Known List of Modifier Words] [One to Three Unknown Words] [Suffix from Known List of Suffix Words] [Optional Modifier from Known List of Modifier Words]

Is RegEx up to this task even partially, if so any assistance in starting this would be appreciated.

Mark H
  • 249
  • 2
  • 10

2 Answers2

2

As MySQL uses Henry Spencer's Regex implementation, commonly used shorthands like \s, \w... are possibly not available, instead one can use posix-like character classes (MySQL regex manual).

desired pattern:

[Up to 4Digits] [Optional Modifier from Known List of Modifier Words] [One to Three Unknown Words] [Suffix from Known List of Suffix Words] [Optional Modifier from Known List of Modifier Words]

What's unclear to me, is: What is a word? What characters besides alphabetic characters may it include? Considering [[:alpha:]] as word-characters or [[:alnum:]]? Could also expand a posix class for certain characters, that should be word-characters e.g. [[:alpha:]\'-].


Also things can become different for different collations like utf8_general_ci, utf8_bin e.g.:

SELECT 'A' REGEXP '[a-z]' => 1

SELECT BINARY 'A' REGEXP '[a-z]' => 0


pattern

Assuming [[:alpha:]] as word-characters and [[:space:]] as separator/boundary:

  • 1.) [1-9][0-9]{0,3} ...should match up to 4 digits: 1 to 9999
  • 2.) ([[:space:]](mod1a|mod1b|mod1c))? ...optional Modifier word...*
  • 3.) ([[:space:]][[:alpha:]]+){1,3} ...one to Three Unknown Words
  • 4.) ([[:space:]](sfxa|sfxb|sfxc)) ...suffix from Known List...
  • 5.) ([[:space:]](mod2a|mod2b|mod2c))? ...optional Modifier word *

The questionmark after the ) makes the parenthesized group optional. Inside there is another parenthesized group with an alternation of different words. See the SO Regex FAQ for further details.

There are also word-boundaries (zero-width) available in MySQL regex: [[:<:]]word[[:>:]]


Putting it all togehter and between ^ start and $ end anchor, the pattern could look like:

SELECT '9999 mod1a aa ab ac sfxa mod2a'
REGEXP '^[1-9][0-9]{0,3}([[:space:]](mod1a|mod1b|mod1c))?([[:space:]][[:alpha:]]+){1,3}([[:space:]](sfxa|sfxb|sfxc))([[:space:]](mod2a|mod2b|mod2c))?$'

Test and modify it to your needs. Hope this helps, happy regexing!

Community
  • 1
  • 1
Jonny 5
  • 11,051
  • 2
  • 20
  • 42
  • 1
    +1 for the notes on shorthands, its been a while since I've been in MySQL. – Sam May 02 '14 at 17:19
  • Great suggestions in both cases thanks. To answer your "What is a word?" it is literally any combination of 2 or more letters (not #s), case not an issue. The important identifiers are the starting digit(s) and the suffix which can be only one of 10-20 strings (and the optional suffix modifier which can be one of 10). Anytihng can be inbetween really but realistically this will be three words max. Thus Digit (KnownModifer) 1-3Words Suffix (KnownModifier). I'm willing to lose the (rare) 4Words since that will start to get me string matches that are not actual pattern matches. – Mark H May 02 '14 at 19:37
  • Not to beat to death but more accurately this is: [1-4NumberDigit] [Optional: Any 1 of 10 Known Modifer Words] [Any AlphaWord of 2 letter or more case insensitive] [Optional Any AlphaWord of 2 letter or more case insensitive] [Optional: Any AlphaWord of 2 letter or more case insensitive] [Any 1 of 10 Known Suffix Words] [Optional: Any 1 of 10 Known Modifier Words] – Mark H May 02 '14 at 20:15
  • Hmm ok I tried and modified. The issue seems to be (after just running yours as is) the return is 1 or 0. My goal in fact it to search in a text block and extract that pattern, sorry if that was not clear. – Mark H May 02 '14 at 20:41
  • Ok further testing using regex.com; this pattern gets matches only if I include the optional words for some reason. – Mark H May 02 '14 at 21:46
  • Ok almost got it; only one remaining issues; The first optional string ([[:space:]](mod1a|mod1b|mod1c))? does not act optionally whilst the last one does – Mark H May 02 '14 at 23:00
  • Ok done and done for anyone following: the first ? didn't work because it needed an additional open and close parenthese i.e ^[1-9][0-9]{0,3}(([[:space:]](mod1a|mod1b|mod1c)))? T – Mark H May 02 '14 at 23:33
  • Hey @MarkH! Just read the comments, seems, you got it going. Else could you provide some sample input in your question, that it should match, but doesn't and your final pattern. `SELECT '1 mod1b' REGEXP '^[1-9][0-9]{0,3}([[:space:]](mod1a|mod1b|mod1c))?$'` works for me, with and without `mod1b`. – Jonny 5 May 03 '14 at 13:32
1

Something like this may get you started:

[1-9]\d{0,4}\s+   # 1-9 (no leading 0), followed by up to 4 more digits and whitespace
(\w+\s+)?         # Optional modifier (replace \w+ with modifiers) followed by whitespace
(\w+\s+){1,3}     # 1-3 occurrences unknown words (you may want to elaborate on \w+) followed by whitespace
\w+               # Required suffix (replace \w+ with suffixes)
(\s+\w+)?         # Optional modifier (replace \w+ with modifiers) preceeded by whitespace

See this full example for how you may implement your lists of modifiers/suffixes. In this example, our modifiers are ['optional', 'etc'] and our suffixes are ['suffix', 'end'].

This should get you in the right direction, if you need more guidance let me know in the comments.

Sam
  • 18,756
  • 2
  • 40
  • 65
  • Thanks I tried the modification from Johhny5 abovee and then used your regex.com link to test. The interesting thing is the pattern matching worked perfectly at regex.com but fails in mysql (returns 0s). – Mark H May 02 '14 at 23:53