How to match bar, b-a-r, b--a--r etc in a string by Regexp

Question

Given a string, I want to find a word bar, b-a-r, b--a--r etc. where - can be any letter. But interval between letters must be the same.

All letters are lower case and there is no gap betweens.

For example bar, beayr, qbowarprr, wbxxxxxayyyyyrzzz should match this.

I tried /b[a-z]*a[a-z]*r/ but this matches bxar which is wrong.

I am wondering if I achieve this with regexp?

What language are you using? (Perl 5.18 has extended character classes where you can include, for example, alpha, and exclude specific chars.) — DavidO, May 03 '14 at 06:44
You will spend less time on the problem by breaking the it into smaller subproblems and solving each of them incrementally, rather than figuring out how to solve it all at once with a single regex. (edited) — DavidO, May 03 '14 at 06:50
@dystroy, are you serious about this being off-topic because the tags are not all lined up straight? Why not just suggest that shin add the tag? — Cary Swoveland, May 03 '14 at 18:57
@CarySwoveland Because at least two thirds of regex questions are pure garbage and aren't answerable mainly because we have no information regarding the language/library. Now that the tag has been added (which is rare), my comment is useless. — Denys Séguret, May 03 '14 at 19:02

Cary Swoveland · Accepted Answer · 2014-05-03T22:37:39.203

Here's is one way to get all matches.

Code

def all_matches_with_spacers(word, str)
  word_size = word.size
  word_arr = word.chars
  str_arr  = str.chars
  (0..(str.size - word_size)/(word_size-1)).each_with_object([]) do |n, arr|
    regex = Regexp.new(word_arr.join(".{#{n}}"))
    str_arr.each_cons(word_size + n * (word_size - 1))
           .map(&:join)
           .each { |substring| arr << substring if substring =~ regex }
  end
end

This requires word.size > 1.

Example

all_matches_with_spacers('bar',  'bar')               #=> ["bar"]
all_matches_with_spacers('bar',  'beayr')             #=> ["beayr"]
all_matches_with_spacers('bar',  'qbowarprr')         #=> ["bowarpr"]
all_matches_with_spacers('bar',  'wbxxxxxayyyyyrzzz') #=> ["bxxxxxayyyyyr"]

all_matches_with_spacers('bobo', 'bobobocbcbocbcobcodbddoddbddobddoddbddob')
  #=> ["bobo", "bobo", "bddoddbddo", "bddoddbddo"]

Explanation

Suppose

word = 'bobo'
str =  'bobobocbcbocbcobcodbddoddbddobddoddbddob'

then

word_size = word.size  #=> 4
word_arr  = word.chars #=> ["b", "o", "b", "o"]
str_arr = str.chars
  #=> ["b", "o", "b", "o", "b", "o", "c", "b", "c", "b", "o", "c", "b", "c",
  #    "o", "b", "c", "o", "d", "b", "d", "d", "o", "d", "d", "b", "d", "d",
  #    "o", "b", "d", "d", "o", "d", "d", "b", "d", "d", "o", "b"]

If n is the number of spacers between each letter of word, we require

word.size + n * (word.size - 1) <= str.size

Hence (since str.size => 40),

n <= (str.size - word_size)/(word_size-1) #=> (40-4)/(4-1) => 12

We therefore will iterate over zero to 12 spacers:

(0..12).each_with_object([]) do |n, arr| .. end

Enumerable#each_with_object creates an initially-empty array denoted by the block variable arr. The first value passed to block is zero (spacers), assigned to the block variable n.

We then have

regex = Regexp.new(word_arr.join(".{#{0}}")) #=> /b.{0}o.{0}b.{0}o/

which is the same as /bar/. word with n spacers has length

word_size + n * (word_size - 1) #=> 19

To extract all sub-arrays of str_arr with this length, we invoke:

str_arr.each_cons(word_size + n * (word_size - 1))

Here, with n = 0, this is:

enum = str_arr.each_cons(4)
  #=> #<Enumerator: ["b", "o", "b", "o", "b", "o",...,"b"]:each_cons(4)>

This enumerator will pass the following into its block:

enum.to_a
  #=> [["b", "o", "b", "o"], ["o", "b", "o", "b"], ["b", "o", "b", "o"],
  #    ["o", "b", "o", "c"], ["b", "o", "c", "b"], ["o", "c", "b", "c"],
  #    ["c", "b", "c", "b"], ["b", "c", "b", "o"], ["c", "b", "o", "c"],
  #    ["b", "o", "c", "b"], ["o", "c", "b", "c"], ["c", "b", "c", "o"],
  #    ["b", "c", "o", "b"], ["c", "o", "b", "c"], ["o", "b", "c", "o"]]

We next convert these to strings:

ar = enum.map(&:join)
  #=> ["bobo", "obob", "bobo", "oboc", "bocb", "ocbc", "cbcb", "bcbo",
  #    "cboc", "bocb", "ocbc", "cbco", "bcob", "cobc", "obco"]

and add each (assigned to the block variable substring) to the array arr for which:

substring =~ regex

ar.each { |substring| arr << substring if substring =~ regex }

arr => ["bobo", "bobo"]

Next we increment the number of spacers to n = 1. This has the following effect:

regex = Regexp.new(word_arr.join(".{#{1}}")) #=> /b.{1}o.{1}b.{1}o/
str_arr.each_cons(4 + 1 * (4 - 1))           #=> str_arr.each_cons(7)

so we now examine the strings

ar = str_arr.each_cons(7).map(&:join)
  #=> ["boboboc", "obobocb", "bobocbc", "obocbcb", "bocbcbo", "ocbcboc",
  #    "cbcbocb", "bcbocbc", "cbocbco", "bocbcob", "ocbcobc", "cbcobco",
  #    "bcobcod", "cobcodb", "obcodbd", "bcodbdd", "codbddo", "odbddod",
  #    "dbddodd", "bddoddb", "ddoddbd", "doddbdd", "oddbddo", "ddbddob",
  #    "dbddobd", "bddobdd", "ddobddo", "dobddod", "obddodd", "bddoddb",
  #    "ddoddbd", "doddbdd", "oddbddo", "ddbddob"]

ar.each { |substring| arr << substring if substring =~ regex }

There are no matches with one spacer, so arr remains unchanged:

arr #=> ["bobo", "bobo"]

For n = 2 spacers:

regex = Regexp.new(word_arr.join(".{#{2}}")) #=> /b.{2}o.{2}b.{2}o/
str_arr.each_cons(4 + 2 * (4 - 1))           #=> str_arr.each_cons(10)
ar = str_arr.each_cons(10).map(&:join)
  #=> ["bobobocbcb", "obobocbcbo", "bobocbcboc", "obocbcbocb", "bocbcbocbc",
  #    "ocbcbocbco", "cbcbocbcob", "bcbocbcobc", "cbocbcobco", "bocbcobcod",
  #    ...
  #    "ddoddbddob"]

ar.each { |substring| arr << substring if substring =~ regex }

arr #=> ["bobo", "bobo", "bddoddbddo", "bddoddbddo"]

No matches are found for more than two spacers, so the method returns

["bobo", "bobo", "bddoddbddo", "bddoddbddo"]

I realized that I need to change the word from bar to others. So this solution worked very well. Thanks. Once the explanation is up, I will read it. — shin, May 03 '14 at 21:31

score 0 · Answer 2 · edited May 23 '17 at 12:11

For reference, there is a beautiful solution to the overall problem that is available in regex flavors that allow a capturing group to refer to itself:

^[^b]*bar|b(?:[^a](?=[^a]*a(\1?+.)))+a\1r

Sadly, Ruby doesn't allow this.

The interesting bit is on the right side of the alternation. After matching the initial b, we define a non-capturing group for the characters between b and a. This group will be repeated with the +. Between the a and r, we will inject capture group 1 with \1`. This group was captured one character at a time, overwriting itself with each pass, as each character between b and a was added.

See Quantifier Capture where the solution was demonstrated by @CasimiretHippolyte who refers to the idea behind the technique the "qtax trick".

How to match bar, b-a-r, b--a--r etc in a string by Regexp

2 Answers2