Without removing any characters, split a string at a regex match

Question

I want to split this text on the dates but without removing the dates from the string:

sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
   at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 25 fri The Holdup, The Wheeland Brothers
   at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **

The first element in the array would be:

sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
   at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @`

Entries have a variable line count, so I can't split on new lines.

The format of the date is:

month_abbreviation + space(or two) + day_number

Something like this pseudocode:

three_letter_word + whitespace(s) + one_or_two_digit_number

would work.

If each element is on a separate line, maybe just split by lines `string.split("\n")`? — Alexey Shein, Sep 26 '15 at 20:36
@alexey-shein Each entry has an unknown amount of lines, so that won't work. I've since edited the question. — maxple, Sep 26 '15 at 20:43
You need to edit to: 1) make your input a valid string (`'sep 25....'`), 2) assign your input to a variable so that readers can reference it without having to define it (`str = 'sep 25...'`) and 3) show the output you want for the given input. — Cary Swoveland, Sep 26 '15 at 22:04
Your pseudocode would match `"For 25"`, You may be assuming that that string could not appear, that any string comprised of three letters, a space, one of two digits, followed by anything, must be a date. But if that's the case, the question really has nothing to do with dates; you just want to search for a particular pattern. The reason you want to do that is irrelevant. Just tell us what you want to do, not how you want to do it. You also need to indicate whether the strings you want to match must be at the beginning of lines, as in your example, or could be anywhere in the text. — Cary Swoveland, Sep 27 '15 at 16:33

the Tin Man · Answer 1 · 2015-09-30T02:16:37.453

Ruby has a wonderful method that's part of Array (inherited from Enumerable) called slice_before. I'd use it like:

str = <<EOT
sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
    at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 25 fri The Holdup, The Wheeland Brothers
    at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **
EOT

MONTHS = %w[jan feb mar apr may jun jul aug sep oct nov dec]
MONTH_PATTERN = Regexp.union(MONTHS).source # => "jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec"
MONTH_REGEX = /^(?:#{ MONTH_PATTERN })\b/i # => /^(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\b/i

schedule = str.lines.slice_before(MONTH_REGEX).to_a
# => [["sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n",
#      "    at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n"],
#     ["sep 25 fri The Holdup, The Wheeland Brothers\n",
#      "    at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"]]

schedule[0]
# => ["sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n",
#     "    at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n"]

schedule[1]
# => ["sep 25 fri The Holdup, The Wheeland Brothers\n",
#     "    at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"]

slice_before doesn't work on a string, it works on an Array or Enumerator, so the first step is to split the string based on the line-ends using lines, which returns an enumerator. slice_before then looks at each element in the array and creates a sub-array based on the hits it finds that match MONTH_REGEX.

/^(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\b/i basically says "starting at the beginning of the string, find words that match the three-letter month names, whatever their letter-case is".

Because it is a regular expression being used to match the "slice before" point, it's really easy to customize the exact pattern that needs to match. In this particular case, the lines with leading white space are continuation lines, in other words they are secondary, not of primary importance. You will see this sort of data output occasionally. The lines without leading white space are the break lines, signifying the start of a new record. I could break using a pattern of /^\S/, which means "find a line that starts with something that is NOT white space, but I felt matching on something more specific, the month abbreviation, was useful and specific enough without wasting time in the match process. /^\w{3} \d{1,2} \w{3} / would also work but would be overkill since the substring being matched MUST occur at the start of the string because of the ^. If this doesn't make sense then read the Regexp class's documentation and experiment in IRB as it's not at all difficult to figure out.

join the sub-arrays back into strings if you want:

schedule.map(&:join)
# => ["sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n    at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n",
#     "sep 25 fri The Holdup, The Wheeland Brothers\n    at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"]

This is a technique we use in-house to take apart giant configuration files, by breaking them into lines and finding the markers for sections with the regular expressions.

Splitting on the month abbreviation alone is not very discriminating. For example, `str = "Can Tom come to your party?\nMay Beth come too?"; schedule = str.lines.slice_before(MONTH_REGEX).to_a #=> [["Can Tom come to your party?\n"], ["May Beth come too?"]]`. — Cary Swoveland, Sep 27 '15 at 01:00
It's trivial to to incorporate dates and years, but since the remaining lines begin with spaces it's a non-issue. — the Tin Man, Sep 27 '15 at 01:18
Yes, it's trivial, but that may not be apparent to inexperienced Rubiests reading your answer. — Cary Swoveland, Sep 27 '15 at 02:08

Cary Swoveland · Accepted Answer · 2015-10-08T16:42:23.207

You specified that you want to split on dates. I've therefore not split on any string having the specified date format that cannot be converted to a date, including "Sep 31 Sat" and "Sep 26 Wed" (the latter, this year, is "Sat"). I've assumed the date substrings can appear anywhere in the string. If you wish to demand that they begin at the beginning of each line, that's of course an easy modification.

str =
"sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
       at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 31 mon at some other place 
oct 26 sat The Holdup, The Wheeland Brothers
       at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **"

require 'date'

arr = str.split.
          map(&:capitalize).
          each_cons(3).
          map { |a| a.join(' ') }.
          select { |s| Date.strptime(s, '%b %d %a') rescue nil }
  #=> ["Sep 25 Fri", "Oct 26 Sat"]

r = /(#{ arr.join('|') })/i
  #=> /(Sep 25 Fri|Oct 26 Sat)/i

str.split(r)
  #=" ["",
  #    "sep 25 fri",
  # " The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n\
  #  at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n    sep 31\
  #   mon at some other place \n    ",
  # "oct 26 sat",
  # " The Holdup, The Wheeland Brothers\n           at the El Rey Theatre,\
  #   Chico 18+ (a/a with adult) 7:30pm/8:30pm **"]

To avoid empty strings at the beginning and end of the array returned, use:

str.split(r).delete_if(&:empty?)

I'm accepting this answer because it's useful in a lot of situations to specify the `split` boundaries using a block, and it did work for me. — maxple, Sep 30 '15 at 22:54

score 0 · Answer 3 · answered Sep 26 '15 at 21:03

0

Assuming that the OP's description:

three_letter_word + whitespace(s) + one_or_two_digit_number would work

is correct,

text.split(/(?=\w{3} +\d{1,2})/)

answered Sep 26 '15 at 21:03

sawa

156,411
36
254
350

1

You need to put `\b` before `\w{3}` because otherwise it will match in the middle of the word. In fact, it splits on `Chico 18` in OP's example. – Alexey Shein Sep 26 '15 at 21:13
@AlexeyShein I know that. As I emphasized, my code only works under the assumption that what OP wrote is correct. Whether that is not the case is the OP's problem. – sawa Sep 26 '15 at 21:58

score 0 · Answer 4 · answered Sep 26 '15 at 21:29

There are 12 months and 7 days, so you could select for them:

text = <<txt
sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
       at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 25 The Holdup, The Wheeland Brothers
       at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **
txt

text.split(/((?:jan|feb|mar|apr|may|jun|ju|aug|sep|oct|nov|dec)\s+[12]?\d)/).each{|part|
  p part
}
p '-------------'
text.split(/((?:jan|feb|mar|apr|may|jun|ju|aug|sep|oct|nov|dec)\s+[12]?\d(?:\s*(?:mon|tue|wed|thu|fri|sat|sun))?)/).each{|part|
  p part
}

The result:

""
"sep 25"
" fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n       at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n"
"sep 25"
" The Holdup, The Wheeland Brothers\n       at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"
"-------------"
""
"sep 25 fri"
" The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n       at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n"
"sep 25"
" The Holdup, The Wheeland Brothers\n       at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"

Some details on the regex:

(?:....) avoids that the matched part becomes part of the result (as $1, $2...)
Only the complete date match has no (?: and becomes part of the result.
Without the most outer () the match would be deleted in the result.
The regex in my example is case sensitive.
The [123]?\d check for an optional 1,2 or 3 and another number. This would allow day numbers like 32, 33...

score 0 · Answer 5 · answered Sep 26 '15 at 21:29

Since you need to split each occurrence of a date, you need to ascertain where the regex engine is during the matching process. You can use a lookahead ?= followed by your desired token to be captured in order to achieve this.

Take for instance, this pattern (?=[a-zA-Z]{3}\s+\d{1,2}\s+[a-zA-Z]{6,9})

Here, the regex engine would be at the starting position of any word with three letters followed by one or more space(s), one or two digits, one or more space(s), and a word with 6 to 9 letters eg. sep 25 Friday. In this example, the regex engine is before s in sep. Using this knowledge, you can now split the String using any programming language of your choice.

line.split(/?=[a-zA-Z]{3}\s+\d{1,2}\s+[a-zA-Z]{6,9}/);

?=: This is a lookahead that matches the position before the regex token to be captured.

[a-zA-Z]{3}: matches 3 words, since months are words and not numbers e.g sep

\s+\d{1,2}: matches one or more spaces, followed by one or two numbers

\s+[a-zA-Z]{6,9}: matches one or more spaces, followed by at least 6 words and a maximum of 9 words,since the least day in terms of numbers in a week is Friday(6 letters) and the highest is Wednesday(9 letters)

yes it would, but the op just needed a pattern to do that. He can as well, explicitly put all the months, and days of the week in place of `[a-zA-Z]+` — james jelo4kul, Sep 27 '15 at 12:04

score -2 · Answer 6 · answered Sep 26 '15 at 22:56

-2

I can see that the lines except the first line of each record is indented with several spaces, so you can split with str.split(/\n(?!\s+)/).

answered Sep 26 '15 at 22:56

Aetherus

7,994
20
32

Without removing any characters, split a string at a regex match

6 Answers6