7

We are doing some Data Loss Prevention for emails, but the issue is when people reply to emails multiple times sometimes the credit card number or account number will appear multiple times.

How can we get Java Regex to only match strings once each.

So for example, we are using the following regex to catch account numbers that match 2 letters followed by 5 or 6 numbers. it will also omit CR in either case.

\b(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}\b

How can we have it find:

CX12345
CX14584
JB145888
JD748452
CX12345 (Ignore as its already found it above)
LM45855
user202729
  • 2,317
  • 3
  • 12
  • 27
Chosenv3
  • 73
  • 1
  • 3
  • I'd rather suggest matching the last occurrence: [`(?s)\b((?!CR|cr)[A-Za-z]{2}\d{5,6})\b(?!.*\b\1\b)`](https://regex101.com/r/ink9TO/1). In Java, do not forget to use double backslashes. – Wiktor Stribiżew Nov 07 '16 at 16:32
  • Hello, Thanks for the assistance. I have tried adding this string into a Regex tester but its coming back that it doesn't like the (?s) at the beginning of it? Also, when I try and remove the (?s) from it, it still does what it was doing before. Matching multiple of the same string. Any ideas please? The tool im using to test is www.regextester.com using javascript. Thanks – Chosenv3 Nov 08 '16 at 14:19
  • Why "regex tester"? Use it in the Java code. I already provided the test in an online tester in my comment. In Java, use `String pattern = "(?s)\\b((?!CR|cr)[A-Za-z]{2}\\d{5,6})\\b(?!.*\\b\\1\\b)";`. [**Here is a Java demo**](http://ideone.com/7IihtV). – Wiktor Stribiżew Nov 08 '16 at 14:20
  • See http://ideone.com/7IihtV – Wiktor Stribiżew Nov 08 '16 at 14:23
  • Hi guys, apologies for the delay. It doesn't appear to have worked. I think this may be my fault though, I mentioned in the original question that it was java script. I don't think that is exactly the case. I am using this in a console called Mimecast which we use as an external mail service. They ask us for a single line of Regex to capture within each email that goes through it. I understand that it can use java based Regex queries and Python. Therefore I am not too sure by what you mean with using the Java code as I am not a java developer. I hope this makes sense. – Chosenv3 Nov 15 '16 at 15:43
  • Link to screenshot here https://postimg.org/image/m8mobbbdj/ – Chosenv3 Nov 15 '16 at 15:46
  • Try `/\b((?!CR|cr)[A-Za-z]{2}\d{5,6})\b(?![\s\S]*\b\1\b)/` or `\\b((?!CR|cr)[A-Za-z]{2}\\d{5,6})\\b(?![\\s\\S]*\\b\\1\\b)` – Wiktor Stribiżew Nov 15 '16 at 16:01
  • That is absolutely perfect! Thank you so much for your help! The first one worked like a dream - \b((?!CR|cr)[A-Za-z]{2}\d{5,6})\b(?![\s\S]*\b\1\b) Would you be able to explain to me why this one worked please, so that I can learn for future ones :) – Chosenv3 Nov 16 '16 at 13:43
  • I added the answer with explanations of the first regex. – Wiktor Stribiżew Nov 16 '16 at 13:50

2 Answers2

4

Unique string occurrence can be matched with

<STRING_PATTERN>(?!.*<STRING_PATTERN>)  // Find the last occurrence
(?<!<STRING_PATTERN>.*)<STRING_PATTERN> // Find the first occurrence, only works in regex
                                        // that supports infinite-width lookbehind patterns

where <STRING_PATTERN> is the pattern the unique occurrence of which one searches for. Note that both will work with the .NET regex library, but the second one is not usually supported by the majority of other libraries (only PyPi Python regex library and the JavaScript ECMAScript 2018 regex support it). Note that . does not match line break chars by default, so you need to pass a modifier like DOTALL (in most libraries, you may add (?s) modifier inside the pattern (only in Ruby (?m) does the same), or use specific flags that you pass to the regex compile method. See more about this in How do I match any character across multiple lines in a regular expression?

You seem to need a regex like this:

/\b((?!CR|cr)[A-Za-z]{2}\d{5,6})\b(?![\s\S]*\b\1\b)/

The regex demo is available here

Details:

  • \b - a leading word boundary
  • ((?!CR|cr)[A-Za-z]{2}\d{5,6}) - Group 1 capturing
    • (?!CR|cr) - the next two characters cannot be CR or cr, the negative lookahead check
    • [A-Za-z]{2} - 2 ASCII letters
    • \d{5,6} - 5 to 6 digits
  • \b - trailing word boundary
  • (?![\s\S]*\b\1\b) - a negative lookahead that fails the match if there are any 0+ chars ([\s\S]*) followed with a word boundary (\b), same value captured into Group 1 (with the \1 backreference), and a trailing word boundary.
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
1

I would use a Map of some sort here, to keep tally of the strings which you encounter. For example:

String ccNumber = "CX12345";
Map<String, Boolean> ccMap = new HashMap<>();

if (ccNumber.matches("^(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}$")) {
    ccMap.put(ccNumber, null);
}

Then just iterate over the keyset of the map to get unique credit card numbers which matched the pattern in your regex:

for (String key : map.keySet()) {
    System.out.println("Found a matching credit card: " + key);
}
Tim Biegeleisen
  • 387,723
  • 20
  • 200
  • 263
  • Hello, Thank you for the response. However Im not sure if this will work in our email client. I believe it can only look for one basic string. However I will give this a try. – Chosenv3 Nov 07 '16 at 16:16
  • `if (ccNumber.matches("^(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}$"))` = `if (ccNumber.matches("(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}"))` = `if (ccNumber.matches("^(?!CR|cr)[A-Za-z]{2}[0-9]{5,6}"))` and will only match a whole sttring that meets the pattern. – Wiktor Stribiżew Nov 07 '16 at 16:21
  • @WiktorStribiżew I'm assuming that the OP will be parsing a set of strings, wanting to know if each one might be a match. – Tim Biegeleisen Nov 07 '16 at 16:27