Ruby has a wonderful method that's part of Array (inherited from Enumerable) called slice_before
. I'd use it like:
str = <<EOT
sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 25 fri The Holdup, The Wheeland Brothers
at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **
EOT
MONTHS = %w[jan feb mar apr may jun jul aug sep oct nov dec]
MONTH_PATTERN = Regexp.union(MONTHS).source # => "jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec"
MONTH_REGEX = /^(?:#{ MONTH_PATTERN })\b/i # => /^(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\b/i
schedule = str.lines.slice_before(MONTH_REGEX).to_a
# => [["sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n",
# " at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n"],
# ["sep 25 fri The Holdup, The Wheeland Brothers\n",
# " at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"]]
schedule[0]
# => ["sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n",
# " at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n"]
schedule[1]
# => ["sep 25 fri The Holdup, The Wheeland Brothers\n",
# " at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"]
slice_before
doesn't work on a string, it works on an Array or Enumerator, so the first step is to split the string based on the line-ends using lines
, which returns an enumerator. slice_before
then looks at each element in the array and creates a sub-array based on the hits it finds that match MONTH_REGEX
.
/^(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\b/i
basically says "starting at the beginning of the string, find words that match the three-letter month names, whatever their letter-case is".
Because it is a regular expression being used to match the "slice before" point, it's really easy to customize the exact pattern that needs to match. In this particular case, the lines with leading white space are continuation lines, in other words they are secondary, not of primary importance. You will see this sort of data output occasionally. The lines without leading white space are the break lines, signifying the start of a new record. I could break using a pattern of /^\S/
, which means "find a line that starts with something that is NOT white space, but I felt matching on something more specific, the month abbreviation, was useful and specific enough without wasting time in the match process. /^\w{3} \d{1,2} \w{3} /
would also work but would be overkill since the substring being matched MUST occur at the start of the string because of the ^
. If this doesn't make sense then read the Regexp class's documentation and experiment in IRB as it's not at all difficult to figure out.
join
the sub-arrays back into strings if you want:
schedule.map(&:join)
# => ["sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n",
# "sep 25 fri The Holdup, The Wheeland Brothers\n at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"]
This is a technique we use in-house to take apart giant configuration files, by breaking them into lines and finding the markers for sections with the regular expressions.