4

After ocr recognition I have a lot of words where instead of o I have 0. So I want to replace any zeros inside words.

Up till now I could do only the following

String result ="I don't like th0se books";
result = result.replaceAll("\\w+0\\w*", "o");
System.out.println("RESULT:" + result);

My code returns RESULT:I don't like o books but I need RESULT:I don't like those books. Could anyone say how to do it?

Pavel_K
  • 8,216
  • 6
  • 44
  • 127
  • Use [lookahead and lookbehind](http://www.regular-expressions.info/lookaround.html) – justhalf Jun 19 '17 at 09:48
  • Why all correct answers are downvoted – Lone_Coder Jun 19 '17 at 09:55
  • Well, if you want to specifically match a `0` glued to a *letter*, better use `.replaceAll("(?<=\\p{L})0|0(?=\\p{L})", "o")`. Or, to only replace `0` in between letters - `.replaceAll("(?<=\\p{L})0(?=\\p{L})", "o")`. – Wiktor Stribiżew Jun 19 '17 at 09:59
  • 2
    Or [`.replaceAll("(?:(?<=\\p{L})|\\G(?!\\A))0", "o")`](https://regex101.com/r/hPbKms/2) (the initial word position is not covered in this case). – Wiktor Stribiżew Jun 19 '17 at 10:05
  • 1
    @WiktorStribiżew why don't you provide an answer instead of comment . since you are good at regex related question? – soorapadman Jun 19 '17 at 10:07
  • @soorapadman: There are too many people in a downvoting mood here. And the question is not precise, there can be many edge cases not accounted for. – Wiktor Stribiżew Jun 19 '17 at 10:08
  • yes, why instead of giving a comment provide a valid solution that kill all other invalid answers – ΦXocę 웃 Пepeúpa ツ Jun 19 '17 at 10:09
  • 1
    @Wiktor Stribiżew Your solution is the only that works right with `I like th0se b00ks with more that 100 pages`. Make the answer and I will accept it. – Pavel_K Jun 19 '17 at 10:09
  • 1
    "And the question is not precise" This is what "Too broad" close votes and downvotes are for. – Unihedron Jun 19 '17 at 10:10
  • 1
    Mostly your solutions are are acceptable . i have seen many of your answer – soorapadman Jun 19 '17 at 10:10
  • @Pavel_K; Do I understand it right that you do not need to replace `0` at the start of a word? Only when it is immediately preceded with a letter? – Wiktor Stribiżew Jun 19 '17 at 10:11
  • op wants to replace zeros only if those are wrapped by any kind of word.... – ΦXocę 웃 Пepeúpa ツ Jun 19 '17 at 10:12
  • @Wiktor Stribiżew Yes, you are right. – Pavel_K Jun 19 '17 at 10:12
  • so this should be the result: ***"...th0se b00ks with 100 pages"*** → ***"...those books with 100 pages"*** – ΦXocę 웃 Пepeúpa ツ Jun 19 '17 at 10:13
  • If Casimir's answer works, please accept his solution. It is based on the same principle as [mine](https://regex101.com/r/vBOqke/1). – Wiktor Stribiżew Jun 19 '17 at 10:19
  • Sorry, I really feel bad at posting an answer when so many answers are already given. Casimir's answer must work for you. My previous regex description: - `(?:(?<=\p{L})|\G(?!\A))` - a location in a string that is either immediately preceded with a Unicode letter (`(?<=\p{L})`) or is at the end of the previous successful match (`\G(?!\A)`, `\G` also matches the start of a string, thus, the negative lookahead is required here to subtract that possibility) - `0` - a `0` character. – Wiktor Stribiżew Jun 19 '17 at 10:30
  • @WiktorStribiżew: the pattern is good and short, but you need to add an alternative for words that starts with `0` like `0yster`, something like `(?=.\pL)` – Casimir et Hippolyte Jun 19 '17 at 10:46
  • @CasimiretHippolyte: I do not know if I should. OP says they [do not want to handle `0` at the start of a word](https://stackoverflow.com/questions/44627208/replace-zeros-to-letter-o-inside-words?noredirect=1#comment76241052_44627208). Or maybe I do not understand the question. I guess it is the latter. Anyway, the initial `0` may start a sentence, and may need to be replaced with `O`, not `o`. Casimir, I leave this question to you. – Wiktor Stribiżew Jun 19 '17 at 10:47

6 Answers6

5

Use a non-word boundary:

result = result.replaceAll("\\B0|0\\B", "o");

That ensures there is at least one word character before or after the 0.

If you want to prevents zero inside a number to be replaced:

result = result.replaceAll("\\b(?!\\d+\\b)(?:0\\B|([^\\W0]+)0)|\\G(?!\\A)0", "$1o");

details:

\\b              # a word boundary
(?!\\d+\\b)      # negative lookahead: not followed by an integer
(?:
    0\\B         # zero and a non-word boundary (means a word character follows)
  |
    ([^\\W0]+)0  # word characters without zero and a zero
)
|
\\G(?!\\A)0  # a zero contiguous to a previous match (not at the start of the string)

(obviously a regex pattern can't make the difference between an isolated "0" and an isolated "o", or between a "0" and a "o" in a reference number, or a number in scientific notation)


other way: capturing all the opponents

result = result.replaceAll("((?>(?:[\\W_]+|\\pL+|\\b\\d+\\b)*))(?:\\B0|0\\B)", "$1o");
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
2

The regex should be "0" not "\\w+0\\w*".

Also, to keep the rest of the words, use capturing groups: result = result.replaceAll("(\\w+)0(\\w*)", "$1o$2");

To only replace between "letters" and ignoring numbers for the requirement: result = result.replaceAll("([a-zA-Z]+)0([a-zA-Z\s0]+)", "$1o$2");

Unihedron
  • 10,251
  • 13
  • 53
  • 66
1
(\B0\B|\B0|0\B)

Matches three cases:

  • 0 in the middle of a word, e.g. "th0se"
  • 0 at the end of a word, e.g. "lid0"
  • 0 at the start of a word, e.g. "0thers"

So, `result.replaceAll("(\B0\B|\B0|0\B)", "o");

However this will also replace I have 101 dogs with I have 1o1 dogs, so you will probably want to further refine your expression, or logic.

While a single regex can be written to achieve this, I feel that it would be simpler and clearer to achieve it in ordinary Java code:

  • split the line into tokens (a token can be a chunk of whitespace or a chunk of non-whitespace - you could capture these using the regex (\s+|\S+) and a Matcher.
  • for each token:
    • if it's whitespace, leave it alone
    • if it consists entirely of numbers and symbols, leave alone
    • else word.replace('0','o')
    • output token
slim
  • 36,139
  • 10
  • 83
  • 117
0

If you don't want to use complex regex, You can iterate over string and do the same.

char c[] = new char[s.length()];
for(int i=0;i<s.length();i++){
    if(s.charAt(i) == '0'){
        c[i] = 'o';
    }else{
        c[i] = s.charAt(i);
    }
}
 //now convert to string.
s = String.valueOf(c);

And for only inside the words, you can check following:

    String s = "I like th0se b00ks ... 100 pages";
    char c[] = new char[s.length()];
    for(int i=1;i<s.length()-1;i++){
        if(s.charAt(i) == '0' && !Character.isDigit(s.charAt(i+1)) && !Character.isDigit(s.charAt(i-1))){
            c[i] = 'o';
        }else{
            c[i] = s.charAt(i);
        }
    }

    //check corner conditions.
    if(s.length() >=1 && !Character.isDigit(s.charAt(1)) && s.charAt(0) == '0'){
        c[0] = 'o';
    }

    if(s.length() >= 2 &&!Character.isDigit(s.charAt(s.length()-2)) && s.charAt(s.length()-1) == '0'){
        c[s.length()-1] = 'o';
    }

    //now convert to string.
    s = String.valueOf(c);
    System.out.println(s);
Kaushal28
  • 4,823
  • 4
  • 30
  • 57
  • Downvote to the answer without even mentioning the reason :( – Kaushal28 Jun 19 '17 at 09:51
  • 1
    This is an unnecessarily complex way to do a simple replace, and completely misses the OP's requirement that they only want to replace `0` when it occurs _inside a word_. – khelwood Jun 19 '17 at 09:57
  • @khelwood It might be complex way for experts. But I think it is easiest way to do this. And it is doing exactly what is requested by the OP. – Kaushal28 Jun 19 '17 at 10:00
  • And yes, I agree that it uses an extra array. But time complexity wise it is same as the regex: https://stackoverflow.com/questions/4378455/what-is-the-complexity-of-regular-expression – Kaushal28 Jun 19 '17 at 10:01
  • 1
    **1)** You skip the first and last character **2)** You don't check for every numerical character (this is not restricted to '0' to '9') **3)** Is `#0` is seen as a word or not ? – AxelH Jun 19 '17 at 10:15
  • Okay updating the answer @AxelH – Kaushal28 Jun 19 '17 at 10:16
-1

Try: result = result.replaceAll("(\\w+)0(\\w+)", "$1o$2");

Using the input: "I don't like th0se books 00 1230"

You get: "I don't like those books 00 1230"

EDIT:

If you use: result = result.replaceAll("([a-zA-Z]+)0([a-zA-Z]+)", "$1o$2");, it should work for the "I don't like th0se books 00 1230 1230456" string too.

Marcell
  • 123
  • 1
  • 1
  • 10
  • 2
    duplicate answer – JacksOnF1re Jun 19 '17 at 09:55
  • I like that you changed * into + but on the other hand I'm starting to wonder what the original problem of the code was. What should be the desired behaviour when the OCR matches `P00H`, for example? I can only pause and ponder and wonder what the vague problem statement represented. – Unihedron Jun 19 '17 at 09:56
  • Destructs numbers having zero in them like `1230456`. – Markus Benko Jun 19 '17 at 09:57
  • True. I did not think about that. But maybe - as you pointed out - the correct answer depends on the needs of the usage of this algorithm. – Marcell Jun 19 '17 at 09:59
-3

You could make use of sed command and pass it to java as an array sed -i s/0/o/g filename

-i - Changes are saved to new file

s - This is for the search

0 - character to be searched for

o - character to be inserted

To check how to use sed as an array in Java, check this link How to run sed command from java code

Let me know if this works for you