4

I wanted to know how to achieve the following scenarios with java regex:

  • find all consequent numbers in a string which are not separated by alphabetic characters, count them, and if the number of digits is between 4 and 5(5 included), then replace them with "*"

Examples:

  • "0000" will become "****"
  • "any text 000 00 more texts" will become "any text ***** more texts"..notice that the space is removed
  • "any text 000 00 more texts 00" will become "any text ***** more texts 00"

  • "any text 000 00 more texts 00 00" will become "any text ***** more texts ****"

  • "any text 00-00 more texts 00_00" will become "any text **** more texts ****"

To find the numbers I have tried:

  • (\d*)(?=[^a-bA-Z]*) and

  • (\d*)([^a-bA-Z])(\d*)

  • (\d*)([^a-bA-Z])(\d*)

But even matching the cases does not work.

I need more understanding of how to do regex operations.

Cœur
  • 32,421
  • 21
  • 173
  • 232
mnish
  • 3,695
  • 11
  • 34
  • 53
  • What if we have `foo 123 56 78 bar`? Should it not be replaced with `*` or maybe we should expect something like `foo ***** 78 bar`? – Pshemo Mar 11 '16 at 16:43
  • It should not be replaced with '*' – mnish Mar 11 '16 at 16:45
  • @Pshemo you have 7 consequent numbers not separated by alphabetic characters, so the condition should not match I guess – Bax Mar 11 '16 at 16:45
  • 1
    This is the [working demo](https://ideone.com/EN927z). And a [regex demo](https://regex101.com/r/qS3jQ6/1). – Wiktor Stribiżew Mar 11 '16 at 16:58
  • @WiktorStribiżew Would you please suggest me some regex related tutorials or books you followed? or which one be the best way to learn regex? – SkyWalker Mar 11 '16 at 17:06
  • Actually, I am not sure my suggestion should be posted. Please test and if it works OK, does not freeze anything and no stack overflow errors appear, I will post. – Wiktor Stribiżew Mar 11 '16 at 17:07
  • 1
    And I do not know your level of regex knowledge :) so that I can only suggest doing all lessons at [regexone.com](http://regexone.com/), reading through [regular-expressions.info](http://www.regular-expressions.info), [regex SO tag description](http://stackoverflow.com/tags/regex/info) (with many other links to great online resources), and the community SO post called [What does the regex mean](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). – Wiktor Stribiżew Mar 11 '16 at 17:08
  • @WiktorStribiżew You are great. Thanks for helping me. – SkyWalker Mar 11 '16 at 17:15
  • @WiktorStribiżew From what I see it doesn't handle correctly example from my comment: https://ideone.com/IJcUix (unless I misunderstood OP response). – Pshemo Mar 11 '16 at 17:22
  • 1
    @pshemo That is another reason why I do not post it - the lookahead boundaries should be adjusted. – Wiktor Stribiżew Mar 11 '16 at 17:24
  • My approach won't work well in Java (yes, we can use a constrained width lookbehind, but it is rather clumsy), good there are lots of answers here. – Wiktor Stribiżew Mar 14 '16 at 08:05

4 Answers4

1

Here's a method you can try:

s = s.replaceAll ("\\d{5}", "*****").replaceAll("\\d{4}", "****")
for (int i = 1; i < 5; i++) {
    s = s.replaceAll("(\\d{" + i + "})([^A-Za-z]*)(\\d{" + (5 - i) + "})", "*****");
}
for (int i = 1; i < 4; i++) {
    s = s.replaceAll("(\\d{" + i + "})([^A-Za-z]*)(\\d{" + (4 - i) + "})", "****");
}
Leo Aso
  • 9,400
  • 3
  • 16
  • 40
1

You could use something like:

private static final Pattern p = Pattern
        .compile( "(?<!\\d[^a-z\\d]{0,10000})"
                + "\\d([^a-z\\d]*\\d){3}([^a-z\\d]*\\d)?"
                + "(?![^a-z\\d]*\\d)", Pattern.CASE_INSENSITIVE);

public static String replaceSpecial(String text) {
    StringBuffer sb = new StringBuffer();
    Matcher m = p.matcher(text);
    while (m.find()) {
        m.appendReplacement(sb, m.group(2) == null ? "****" : "*****");
    }
    m.appendTail(sb);
    return sb.toString();
}

Usage demo:

System.out.println(replaceSpecial("foo 123 56 78 bar 12 32 abc 000_00"));
System.out.println(replaceSpecial("0000"));
System.out.println(replaceSpecial("any text 00 00 more texts"));
System.out.println(replaceSpecial("any text 000 00 more texts 00"));
System.out.println(replaceSpecial("any text 000 00 more texts 00 00"));
System.out.println(replaceSpecial("any text 00-00 more texts 00_00"));

Result:

foo 123 56 78 bar **** abc *****
****
any text **** more texts
any text ***** more texts 00
any text ***** more texts ****
any text **** more texts ****

Idea/explanation:

We want to find series of digits which have between zero or more non-digit but also non-alphabetic characters (we can represent them via [^\\da-z] but IMO [^a-z\\d] looks better so I will use this form). Length of this series is 4 or 5 which we can write as

digit([validSeparator]*digit){3,4} //1 digit + (3 OR 4 digits) => 4 OR 5 digits

but we need to have some way to recognize if we matched 4 or 5 digits because we need to have some way to decide if we want to replace this match with 4 or 5 asterisks.
For this purpose I will try to put 5th digit in separate group and will test if that group is empty. So I will try to create something like dddd(d)?.

And that how I came up with

  "\\d([^a-z\\d]*\\d){3}([^a-z\\d]*\\d)?"
//                      ^^^^^^^^^^^^^^^ possible 5th digit

Now to need to make sure that our regex will match only dddd(d) which are not surrounded by any digit from left or right because we don't want to match any of cases like

d ddddd
 dddddd
 ddddd d

So we need to add tests which will check if before (or after) our match there will be no digit (and valid separator). We can use here negative-look-around mechanisms like

  • "(?<!\\d[^a-z\\d]{0,10000})" - I used {0,10000} instead of * because look-behind needs to have some maximal length which prevents us from *.

  • "(?![^a-z\\d]*\\d)"

So now all we needed to do is combine these regexes (and make it case insensitive or instead of a-z use a-zA-Z)

Pattern p = Pattern.compile( "(?<!\\d[^a-z\\d]{0,10000})"
                           + "\\d([^a-z\\d]*\\d){3}([^a-z\\d]*\\d)?"
                           + "(?![^a-z\\d]*\\d)", Pattern.CASE_INSENSITIVE);

Rest is simple usage of appendTail and appendReplacement methods from Matcher class which will let us decide dynamically what to use as replacement of founded match (I tried to explain it better here: https://stackoverflow.com/a/25081783/1393766)

Community
  • 1
  • 1
Pshemo
  • 113,402
  • 22
  • 170
  • 242
1

Try with:

(?<!\d|\d[_\W])(?=(\d|(?<=\d)[_\W]\d){4,5}(?!\d|[_\W]\d))\d|(?<=(?!^)\G)[_\W]?\d

DEMO

The (?<!\d|\d[_\W])(?=(\d|(?<=\d)[_\W]\d){4,5}(?!\d|[_\W]\d))\d part match for a digit, if:

  • is not preceded by digit or digit followed by alphabetical char,
  • is followed by 4-5 digits or combinations of digit and non alphabetical char
  • after 4-5 digits/combination doeasn't occur another digit or combination,

The (?<=(?!^)\G)[_\W]?\d part match if:

  • it follows another match, but not the beginning of line,
  • is a digit, or combination of digit and non alphabetical char,

Example in Java:

public class RegexExample {
    public static void  main(String[] args) {
        String[] examples = {"0000","any text 000 00 more texts","any text 000 00 more texts 00",
                "any text 000 00 more texts 00 00","any text 00-00 more texts 00_00","test 00 00 00 00 00 test"};

        for(String example : examples) {
            System.out.println(example.replaceAll("(?<!\\d|\\d[_\\W])(?=(\\d|(?<=\\d)[_\\W]\\d){4,5}(?!\\d|[_\\W]\\d))\\d|(?<=(?!^)\\G)[_\\W]?\\d","*"));
        }
    }
}

with output:

****
any text ***** more texts
any text ***** more texts 00
any text ***** more texts ****
any text **** more texts ****
test 00 00 00 00 00 test
m.cekiera
  • 5,307
  • 5
  • 19
  • 35
0

Try this

(?:(?:\d[- _]*){6,})|(?<num_1>\d[- _]*)(?<num_2>\d[- _]*)(?<num_3>\d[- _]*)(?<num_4>\d)(?<num_5>[- _]*\d)?

Demo

Explanation:
(?:(?:\d[- _]*){6,}): no capture matches row of 6+ numbers
(?<num1>\d[- _]?)(?<num2>\d[- _]?)(?<num3>\d[- _]?)(?<num4>\d)(?<num5>[- _]?\d)?: captures row of 4-5 numbers.

Input

"1-2-3-4-5-6" => no
"1-2-3-4-5"   => yes match 1
"1-2-3-4"     => yes match 2
"1-2-3"       => no
"123456"      => no
"12345"       => yes match 3
"1234"        => yes match 4
"123"         => no
foo 123 56 78 bar

Output:

MATCH 1
num_1   [21-23] `1-`
num_2   [23-25] `2-`
num_3   [25-27] `3-`
num_4   [27-28] `4`
num_5   [28-30] `-5`
MATCH 2
num_1   [50-52] `1-`
num_2   [52-54] `2-`
num_3   [54-56] `3-`
num_4   [56-57] `4`
MATCH 3
num_1   [119-120]   `1`
num_2   [120-121]   `2`
num_3   [121-122]   `3`
num_4   [122-123]   `4`
num_5   [123-124]   `5`
MATCH 4
num_1   [148-149]   `1`
num_2   [149-150]   `2`
num_3   [150-151]   `3`
num_4   [151-152]   `4`

Then replace them with * : )

Tim007
  • 2,486
  • 1
  • 9
  • 20