2

I am trying to throw everything out of a string except letters, spaces and decimals of the type [0-9]{1,3} before the dot and [0-9]{1,2} after the dot.

I've come up with this in java

replaceAll("[^\\p{L}\\s(\\s[0-9]{1,3}(\\\\.[0-9]{1,2})?)]", "+"));

I really can't get it to work. I'm a real newbie when it comes to regex.

Examples

This : mpla 12.5 mpla 121.22 mpla 1.52 mpla 1 mpla 1000 mpla 1000.12 mpla12.5

Returns : mpla 12.5 mpla 121.22 mpla 1.52 mpla 1 mpla + mpla + +

//Special caution on mpla12.5 this too is not wanted because I want a format of \sNUMBER\s

Alkis Kalogeris
  • 14,519
  • 11
  • 50
  • 98

2 Answers2

3

Just a note, regexes are not really good for doing "not" semantics outside of character classes. So, I would suggest concentrating on what you do want to keep and build your result from that:

String s = "mpla 12.5 mpla 121.22 mpla 1.52 mpla 1 mpla 1000 mpla 1000.12 mpla12.5";
Pattern p = Pattern.compile("[A-Za-z]+|\\s(\\d{1,3}(\\.\\d{1,2})?\\s)?");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find()) {
    sb.append(m.group());
}
System.out.println(sb.toString());

Outputs:

mpla 12.5 mpla 121.22 mpla 1.52 mpla  mpla  mpla  mpla

I think that this is what you are asking for in the strictest sense -- note that there are multiple spaces in the result that you will have to sanitize if desired.

Edit: Let me clarify what I mean by regexes are not really good for doing "not" semantics outside of character classes. If you just wanted to "match any character that isn't a letter or whitespace" that would be easy with a negated character class: [^A-Za-z\\s]. However, once you start needing negations of multi-character groupings (\\d{1,3}\\.\\d{1,2} for example) it gets ugly. You can technically do it using negative lookaheads, but it's kludgy and not very intuitive. This post explains it well: https://stackoverflow.com/a/406408/1311394

Edit 2: Based on your comments, I believe that a solution utilizing String.split() along with regex matching will do what you want much easier:

String s = "12.5 mpla 12.5 mpla 121.22 mpla 1.52 mpla 1 mpla 1000 mpla 1000.12 mpla12.5";
StringBuilder sb = new StringBuilder();
for (String token : s.split("\\s+")) {
    if (token.matches("[A-Za-z]+|\\d{1,3}(\\.\\d{1,2})?")) {
        sb.append(token).append(" ");
    }
}
System.out.println(sb.toString());

Output:

12.5 mpla 12.5 mpla 121.22 mpla 1.52 mpla 1 mpla mpla

This should take care of the cases mentioned in the comments. Most of the time a very complex regex is a code smell, and there's usually a simpler way to solve the problem.

Community
  • 1
  • 1
ach
  • 5,764
  • 1
  • 22
  • 28
  • It's almost there. I want to be integer OR decimal. I think this should do it. Pattern.compile("[A-Za-z]+|\\s(\\d{1,3}(\\.\\d{1,2})?\\s)?"); . Please edit your post so I can accept it. And if you can take the time, please elaborate on the "regexes are not really good for doing "not" semantics outside of character classes" topic. As I mentioned I'm a newbie. Thank you for your help – Alkis Kalogeris Mar 22 '13 at 14:45
  • 1
    Note that it skips the first group if it's a number, e.g. `"12.5 mpla 12.5..."`. I believe it would also extract `mpla` from `mpla12.5`, not ignore it. – Bernhard Barker Mar 22 '13 at 15:21
  • Kind of ugly, but you could add another alternation with the `\\s` replaced with the start anchor `$`: `[A-Za-z]+|\\s(\\d{1,3}(\\.\\d{1,2})?\\s)?|^(\\d{1,3}(\\.\\d{1,2})?\\s)?` I'm not aware of a way to conditionally match an anchor to be able to roll it into the second alternation (but would love to know how, if there's a way). As for the second case (keeping the `mpla` from `mpla12.5`), that seems to be the correct behavior based on strict interpretation of the question... – ach Mar 22 '13 at 15:38
  • No it's not. I mention it very clearly that I don't want that. I didn't have time to check it out and I thought that it's ok but it's not. – Alkis Kalogeris Mar 22 '13 at 17:50
  • I interpreted your comment "Special caution on mpla12.5 this too is not wanted because I want a format of \sNUMBER\s" as meaning that you didn't want to keep the whole string "mpla12.5" because the number should be surrounded by spaces, which would leave just "mpla". It seems that what you really want is to tokenize the string with a space delimiter and only keep the parts that are either a sequence of letters or a specific number format. Is that accurate? – ach Mar 22 '13 at 18:36
  • @alkis Please see my edit, does that reflect what you're looking for? – ach Mar 22 '13 at 19:02
  • Yes. This is spot on. Although it will be an overkill (too much data, with very few of these numbers) I can't think of anything better or cleaner. Thank you for your time, you really helped. – Alkis Kalogeris Mar 22 '13 at 19:07
  • 1
    No problem. In the future, it would be helpful if you posted what the correct output should look like for your example input. – ach Mar 22 '13 at 19:13
0

Try this out :

        String data = "ds#@234f&^%%sd232.ertre3df6g#@$566";
    String replaceString = data.replaceAll("[^\\w\\s\\.]", "");

    System.out.println(data);

    String firstPart = replaceString.split("\\.")[0];

    String secondString = "."+ replaceString.split("\\.")[1];


    String finalString = firstPart + secondString.replaceAll("[^\\d\\.]", "");
    System.out.println(finalString);
Ankur Shanbhag
  • 7,376
  • 2
  • 26
  • 37