2

I need to detect four-word passphrases in content, which are sequences between n and m words long. ALL sequences of four words have to be detected, even those that are partially overlapping, which is my problem since I only know how to write a sequence that consumes four words and then moves to the next sequence of fords starting at the end of that one.

E.g. if I have the sequence:

random correct horse battery staple bug tin hat

and I use:

([A-Za-z0-9]+ ){4}([A-Za-z0-9]+)

it will only find:

  • random correct horse battery

and

  • staple bug tin hat

But I actually need to find all of the following instead:

  • random correct horse battery

  • correct horse battery staple

  • horse battery staple bug

  • battery staple bug tin

  • staple bug tin hat

So all four word sequences in the supplied string.

I understand my problem is that my regex is consuming the first four words when it finds the first match.

Anyone can explain how to make a regular expression that only "consumes" the first word and then gives me the next valid sequence starting at the second word and so on?

Thanks!

  • List item
  • 1
    The complexity is going to be here: "even those that are partially overlapping". Your regex is going to get ugly trying to do it with lookaheads and look behinds with multiple overlaps. Saying that to help your problem - can you provide an example text along with your expected results. There might be an elegant way of achieving this. – Steve Tomlin Oct 07 '20 at 16:22
  • 1
    FYI your regex `([A-Za-z0-9]+ ){4}([A-Za-z0-9]+)` captures *5* words, not 4. Change `4` to `3`. – Bohemian Oct 07 '20 at 19:22

3 Answers3

0

You might succeed with lookaheads and look behinds to resolve the multiple overlaps, and if you you succeed I believe the expression is going to be messy. Here is link about regex lookahead, lookbehind:

Regex lookahead, lookbehind and atomic groups

This might help:

It is not solved with only regex. It is a mix with "sliding window" and a matching of four "words":

public static void main(String[] args) {
    String input = "random correct horse battery staple bug tin hat";
    String[] arr = input.split("\\s+");

    Pattern pattern = Pattern.compile("([A-Za-z0-9]+\\s){4}");

    for (int i = 0; i <= arr.length - 4; i++){
        String fourWords = String.format("%s %s %s %s ", arr[i], arr[i + 1], arr[i + 2], arr[i + 3]);
        Matcher matcher = pattern.matcher(fourWords);

        if(matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

Output:

random correct horse battery
correct horse battery staple
horse battery staple bug
battery staple bug tin
staple bug tin hat 
DigitShifter
  • 481
  • 3
  • 7
0

Can’t be done just with regex, because input returned is consumed.

Split the string and work with the tokens, eg

List<String> words = Arrays.asList(sentence.split(" "));
List<List<String>> fourGrams = new ArrayList<>();
for (int i = 0; i < array.length - 4; i++) {
    fourGrams.add(words.subList(i, i + 4));
}
Bohemian
  • 365,064
  • 84
  • 522
  • 658
0

As pointed out in the comments, to match 4 words the quantifier has to be 3 instead of 4 to make a total of 4.

As you are matching characters [A-Za-z0-9] you can start the match with a word boundary \b

Then (if supported) use a positive lookahead capturing the 4 words in a single capturing group.

\b(?=((?:[A-Za-z0-9]+ ){3}[A-Za-z0-9]+\b))
  • \b A word boundary
  • (?= Positive lookahead, assert directly to the right is
    • ( Capture group 1
      • (?:[A-Za-z0-9]+ ){3} Repeat 3 times matching 1+ times the character class followed by a space
      • [A-Za-z0-9]+\b Match 1+ times any of the listed followed by a word boundary
    • ) Close group 1
  • ) Close lookahead

Regex demo

Note that opposed to the pattern that you tried, the quantifier is repeating the non capturing group (?:[A-Za-z0-9]+ ){3} because repeating a capture group only returns the capture for the last iteration.

There is no language tagged, but for example in Javascript

const regex = /\b(?=((?:[A-Za-z0-9]+ ){3}[A-Za-z0-9]+\b))/g;
const str = `random correct horse battery staple bug tin hat`;
let m;

while ((m = regex.exec(str)) !== null) {
  // This is necessary to avoid infinite loops with zero-width matches
  if (m.index === regex.lastIndex) {
    regex.lastIndex++;
  }
  console.log(m[1]);
}
The fourth bird
  • 96,715
  • 14
  • 35
  • 52