Regex creation based upon input

Question

I have a web application, written in PHP that incorporates Javascript and JQuery, that will be used as my company's Inventory Management System (IMS). What I would like to be able to create is a Regex expression based upon user input of a value.

The idea behind this is that most manufacturers' serial numbers schema, length of characters and mixture of alpha to numeric values, is unique to a certain part. So when a part is added to the IMS and the first serial number is scanned into the system I would like a Regex statement to be built and saved to a database table corresponding to that part type. Any future times that a serial number is scanned the part should be auto-selected as the part type as it matches the serial number schema for that manufacturer. I understand this methodology may not always hold true to a single part so I could even return a list of parts that match the schema instead of the user needing to look it up in the catalog.

The basis of my question is what is the best starting point to look at having a function in code be able to decipher a value given by a user to create a Regex expression? I'm not requesting a full function but a starting point of how to look at my situation and goal so I can understand where to begin. I've scratched my head long enough and starting writing functions numerous times just to delete the entire block knowing I was headed for disaster.

Anything in code is possible - is this feasible?

EDIT - ADDED SAMPLE VALUES

DVD-RW (Optical Drives)

1613518L121
1613509L121
1613519L121

VGA Output Cards

0324311071068
0324311071134

COM Expansion Cards

608131234
608131237

Hard Drives

WMAYUJ753738
WMAYUJ072099
WMAYUJ683739
WMAYUJ844900

As you can see some values are going to be numeric only of a certain length of characters. Others will have alpha characters at the beginning followed by a series of numbers. Others may have alpha/numeric characters interspersed with each other. In most every single case a simple length of alpha/numeric rule is going to fit for identifying a singular part type in our list of goods. However, in those cases that more than one expression matches a value, I can simply have the application show a list of two or more products that match the regex and prompt the user to select the proper part. This, overall, will save time and mistakes in selecting a product type in the WMS database.

Thanks for the comments. I understand I'm not asking a question that has one answer to it. I'm looking for a starting point on how to best step through the string and spit out a corresponding Regex statement that would match the value.

I don't think you can do it. You can't make a rule based on one example. That's my opinion, anyway. — Pete, May 22 '12 at 13:43
Pete, in you opinion, how many samples would one need in order to create a valid expression? Or are you suggesting that the entire idea itself is not valid? — Jeff, May 22 '12 at 13:47
Ah. That's a different thing altogether. It would be a bit like cracking codes, wouldn't it? The danger, as I see it, is that you make a rule based on X cases and then chuck back a load of data because the rule is flawed in some way. — Pete, May 22 '12 at 13:49
But don't let me stop you trying - it could be a fascinating project. Do you have some examples? — Pete, May 22 '12 at 13:51
maybe you could process by steps? also, if you have some samples values, i'm pretty sure that you would have some others replies? BTW, i would have try to first search for special chars that must be protected when dealing with regexp (like dot [.], splitting my value on these) or characters that can be seen as separators like '-' or '_' (think of ISBN numbers). Then having some classes match like [0-9]+, [A-Z]+, or [a-z]+ coupled their length (min, max) extraction? So that you could have some custom rules like 'myLettersWithLengthX' + (mySeparators('-')) + myNumbersLengthBetween(m,n)... — user1340802, May 22 '12 at 13:57
Maybe you could also check on the bioinformtics algorithms dealing with [Longest common subsequence problem](http://en.wikipedia.org/wiki/Longest_common_subsequence_problem) and others multi align algorithms. You will also need to be able to have a convergence criteria to decide when you stop searching/aligning. Other point of view may be to view your question like buildind some mask from samples? But yes, you will probably definitively need many samples for each one of yours manufacturers numbers and think that you may have false matching (how will you handle these?). Best. — user1340802, May 22 '12 at 13:57
last, you could find some idea from the NLP community like [Learning Information Extraction Patterns from Examples](http://www.cise.ufl.edu/~cgrant/projects/public/morpheus/files/learning_ir_patterns_from_examples.pdf) but here again, it like having a tank for... — user1340802, May 22 '12 at 14:06
This is a nice idea, but this is completely opposite to what regular expressions aim at. Regular exression defines a __language__ (set of strings), but here __language__ defines a regular expression. That's just the same as asking: having 5/50/5000 English words, can you come up with English grammar? The only way I came up with is building a huge alteration and optimizing it (building a prefix tree). — madfriend, Jun 16 '12 at 22:44

score 2 · Accepted Answer · answered May 22 '12 at 14:11

As @Pete says, I think you have set yourself too ambitious a goal. Some thoughts, perhaps overly generalized from your specific needs.

I take it that you want to scan a serial number like 1-56592-487-8 and infer that the regular expression /\d-\d{5}-\d{3}-\d/ matches parts of this type from a given manufacturer. (This happens to be the ISBN-10 for my copy of "Java in a Nutshell." ISBNs are not serial numbers, but work with me.) But you can't infer from a handful of examples what pattern the manufacturer uses. Maybe the first character position is a hex digit (0-F). Maybe the last character is a checksum that can be a digit or X (like ISBNs). Maybe there is a suffix, not always present, that denotes the plant. So you will find yourself building up many patterns for the same manufacturer/part type as new instances of the part come in.

You will also have the reverse problem. A maker of widgets uses the regex /[A-Z]{3}\d{7}/, and a maker of sonic screwdrivers uses the same pattern.

That said, about the best you can do is something this:

for each character in the scanned serial number
    if it is a capital letter
        add [A-Z] to the regular expression
    else if it is a digit
        add \d to the regular expression
    else 
        add the character itself to the regular expression, escaped as necessary
 end for
 collapse multiple occurrences with the {,} interval qualifier

The rules for Vehicle Identification Numbers may also be inspiring. Think about how you would infer the rules for VINs, given a handful of examples.

Your response is valid and I understand the point you are making. ISBN, while a good example, does not fit the model as ISBN numbers are the same format regardless the publisher. The schema by which a manufacturer creates a unique serial number for a part is not an accepted standard. Regardless whether one character in the serial number is a checksum or not, it is still a character that can be verified through Regex. Your code sample makes sense and is a good starting point. Maybe that is as complicated as this routine needs to be or maybe there's something that can make it more robust. — Jeff, May 22 '12 at 14:25

score 0 · Answer 2 · edited May 23 '17 at 10:26

EDIT: sorry, my sample code is buggy you need this kind of algorithms as first step on the parts that you will guess: longest substring or this

you will need to add iteratives and some masking like explained above and by David, also on the sample below, the "L121" for DVD-RW is not guessed (as i have stated that i must be starting with 'common'). So you will need to find all the common consecutive subsequences and decide which one are relevant! (probably with a kind of maximization gain function )

using the second link long_substr :

>>> for x in d:
    for y in d:
        if x == y: continue
        common = long_substr([x, y])
        length = len(common)
        if x.startswith(common) and y.startswith(common):
            print "\t".join((x, y, str(length), common))

that produce =>

0324311071068   0324311071134   10  0324311071
0324311071134   0324311071068   10  0324311071
1613519L121 1613518L121 6   161351
1613519L121 1613509L121 5   16135
WMAYUJ844900    WMAYUJ753738    6   WMAYUJ
WMAYUJ844900    WMAYUJ072099    6   WMAYUJ
WMAYUJ844900    WMAYUJ683739    6   WMAYUJ
WMAYUJ753738    WMAYUJ844900    6   WMAYUJ
WMAYUJ753738    WMAYUJ072099    6   WMAYUJ
WMAYUJ753738    WMAYUJ683739    6   WMAYUJ
1613518L121 1613519L121 6   161351
1613518L121 1613509L121 5   16135
WMAYUJ072099    WMAYUJ844900    6   WMAYUJ
WMAYUJ072099    WMAYUJ753738    6   WMAYUJ
WMAYUJ072099    WMAYUJ683739    6   WMAYUJ
WMAYUJ683739    WMAYUJ844900    6   WMAYUJ
WMAYUJ683739    WMAYUJ753738    6   WMAYUJ
WMAYUJ683739    WMAYUJ072099    6   WMAYUJ
608131237   608131234   8   60813123
1613509L121 1613519L121 5   16135
1613509L121 1613518L121 5   16135
608131234   608131237   8   60813123

--- first buggy reply start here

below is the first part of my reply, that could only help you to understand where i was wrong and may be give you some ideas :

a sample using the Longest Common Subsequence probleme solver LCS with your particular need, that i can think of being a first step of a process of guessing what will be common ?

it is in Python, but for the demo part, it can be easily readable (or can be cut and paste in IDLE (the python editor)) assumong that you use the ActiveState Code Recipes of the first link above

this has to do with bio informatics (think of genes alignment)

you will need something to decide what is the most interesting common sequence (may be having a minimal length? and then proceed with masking like already proposed by David or in my comment

(at first i do not see that the LCS what not a LCS consecutive solver, while you will need it to be! SO my first usage of the LCS solver is buggy :( as it is not contiguous, i have MAYUJ8 or WMAYUJ7 and not WMAYUJ - which is shorter ! while solver find longest common characters without expecting them to be consectuive! - again sorry for that)

>>> raw = """1613518L121
1613509L121
1613519L121

0324311071068
0324311071134

608131234
608131237

WMAYUJ753738
WMAYUJ072099
WMAYUJ683739
WMAYUJ844900"""
>>> d = dict()
>>> for line in raw.split("\n"):
    if not line.strip(): continue
    value = line.strip()
    d[value] = 1

>>> for x in d:
    for y in d:
        if x == y: continue
        length = LCSLength(x, y)
        common = LCS(x,y)
        if  length >= 3 and x.startswith(common):
            print "\t".join((x, y, str(length), common))

that produce =>

0324311071068   0324311071134   10  0324311071
0324311071068   608131234   4   0324
0324311071134   0324311071068   10  0324311071
WMAYUJ844900    WMAYUJ753738    7   WMAYUJ8
WMAYUJ753738    WMAYUJ072099    7   WMAYUJ7
608131237   608131234   8   60813123
608131234   608131237   8   60813123

score -1 · Answer 3 · answered Jun 03 '12 at 00:40

-1

Run spam detecting algorithms (statistical one like bayes or similar "learning" ones). This will or won't help you, but if not, I honestly doubt you will ever make any useful logical algorithm here.

answered Jun 03 '12 at 00:40

krzych

1

Regex creation based upon input

3 Answers3