1

I want to find the repeating word from a given String. I want to have a regular expression to find every occurrence of a word. for example "I want to eat apple. apple is a fruit".

the regular expression should find out word "apple".

kasravnd
  • 94,640
  • 16
  • 137
  • 166
Abhijit Bashetti
  • 7,438
  • 6
  • 27
  • 43
  • http://stackoverflow.com/questions/2823016/regular-expression-for-consecutive-duplicate-words – Veselin Davidov Apr 29 '15 at 11:22
  • So do you have tried any thing for figuring your problem out? – kasravnd Apr 29 '15 at 11:22
  • why you want to use regex ? – Saif Apr 29 '15 at 11:22
  • 2
    Regex is not the right way to do this. Use `String#split()` and then add the strings to a `Set`. – TheLostMind Apr 29 '15 at 11:23
  • @TheLostMind : I already tried this using a java code...FYI please have look http://codereview.stackexchange.com/questions/88234/removing-and-counting-repeated-strings...But trying a regular expression... – Abhijit Bashetti Apr 29 '15 at 11:25
  • @AbhijitBashetti - Regex is not the right way of approaching this problem.. regex was not designed for cases like this. You might come up with a regex, but it will most probably break somewhere. – TheLostMind Apr 29 '15 at 11:28
  • @VeselinDavidov : I had a look at it, its not the result...its says no match .. – Abhijit Bashetti Apr 29 '15 at 11:29
  • 1
    To be clear solving this with a regex is only possible in theory, but in practice the complexity becomes too high too quickly with the string length. So a possible approach consists to store positions in the string for each different words in a data structure. – Casimir et Hippolyte Apr 29 '15 at 11:32
  • @CasimiretHippolyte : yes I agree with you...it would be complex...I had given a try with code...I thought of giving a try with regex... – Abhijit Bashetti Apr 29 '15 at 11:39
  • @Saif : I have given a try with code, wanted to try with regex.. – Abhijit Bashetti Apr 29 '15 at 12:05

3 Answers3

1

This works for multiple repetitions and multiline:

    Pattern p = Pattern.compile("\\b(\\w+)\\b(?=.*\\b(\\1)\\b)", Pattern.DOTALL);

    String s = "I want to eat apple. apple is a fruit.\r\n I really want fruit.";
    Matcher m = p.matcher(s);
    while (m.find()) {
        System.out.println("at: " + m.start(1) + " " + m.group(1));
        System.out.println("    " + m.start(2) + " " + m.group(2));
    }

It outputs:

at: 0 I
    41 I
at: 2 want
    50 want
at: 14 apple
    21 apple
at: 32 fruit
    55 fruit
Daniel Sperry
  • 4,091
  • 3
  • 29
  • 40
  • Your regex will consume text between duplicates which will prevent reusing it to find other duplicates like `eat apple eat apple` – Pshemo Apr 29 '15 at 11:33
  • It will not be much more complex. All you need to do is prevent consuming which can be done by look-ahead mechanism. – Pshemo Apr 29 '15 at 11:37
  • @Pshemo: even with a lookahead, this kind of pattern will crash (or take too much times) with few lines of text. – Casimir et Hippolyte Apr 29 '15 at 11:40
  • @CasimiretHippolyte Yes, I also don't like regex (because of its inefficiency like in case of this one greediness, and mistakes like lack of support for `.` to match line separators), but since OP is asking for regex it let him have it (everyone has right to do their own mistakes). I hope that OP will test this approach and will figure out that upvoted comments ware upvoted for a reason. – Pshemo Apr 29 '15 at 11:46
  • @Pshemo: I appreciate your support. I have already given a try by writing a piece of java code...I thought a giving a try with regex...As I m not the expert of regex..asking for the help...here is my code ...Its not doing http://codereview.stackexchange.com/questions/88234/removing-and-counting-repeated-strings – Abhijit Bashetti Apr 29 '15 at 11:51
  • @DanielSperry "*@karthik manchala got it perfectly*" actually it is very far from perfect. That solution will not support multiple lines, and in look-ahead it will try to traverse till end of line, even if it will find correct match. – Pshemo Apr 29 '15 at 11:55
  • @Pshemo.. I dont know if it is perfect.. but updated my answer for finding duplicates in multiple lines :) – karthik manchala Apr 29 '15 at 12:09
  • @Pshemo, thanks for your comments I think the answer looks a lot better now. And karthik manchala, good call with the lookahead! – Daniel Sperry Apr 29 '15 at 12:17
  • Yes it does, but please make `.*` reluctant instead of greedy by adding `?` right after it. Don't make regex search for *last* repetition of word. – Pshemo Apr 29 '15 at 12:18
1

You can use the following to match all the duplicate words in a line.

(\\b\\w+\\b)(?=.*\\b\\1\\b)        // matches duplicates only in a single line

Edit: If you want to match duplicates in multiple lines you can use:

(\\b\\w+\\b)(?=[\\s\\S]*\\b\\1\\b)  // or the above regex with DOTALL flag

See demo for single line and demo for multiple lines

karthik manchala
  • 13,025
  • 1
  • 27
  • 54
0

This approach strips out anything that's not alphanumeric or whitespace, splits on the white space and creates a Map of the results.

Stream.of("I? want.... to eat apple    eat apple.      apple, is! a fruit".split("[^\\p{L}\\p{N}]+"))
      .collect(Collectors.groupingBy(s -> s))

Result:

a=[a], apple=[apple, apple, apple], fruit=[fruit], want=[want], eat=[eat, eat], I=[I], is=[is], to=[to]
Steve Chaloner
  • 7,954
  • 1
  • 20
  • 38