I need to detect four-word passphrases in content, which are sequences between n and m words long. ALL sequences of four words have to be detected, even those that are partially overlapping, which is my problem since I only know how to write a sequence that consumes four words and then moves to the next sequence of fords starting at the end of that one.
E.g. if I have the sequence:
random correct horse battery staple bug tin hat
and I use:
([A-Za-z0-9]+ ){4}([A-Za-z0-9]+)
it will only find:
- random correct horse battery
and
- staple bug tin hat
But I actually need to find all of the following instead:
random correct horse battery
correct horse battery staple
horse battery staple bug
battery staple bug tin
staple bug tin hat
So all four word sequences in the supplied string.
I understand my problem is that my regex is consuming the first four words when it finds the first match.
Anyone can explain how to make a regular expression that only "consumes" the first word and then gives me the next valid sequence starting at the second word and so on?
Thanks!
- List item