1

I am trying to write a script that will go through a small text file line by line. It should create a list of tuples that have the date as the first entry and all text and white space up until the next date as the second tuple entry. Then repeat until the file is exhausted.

If this is the contents of the text file:

2018-01-01

Stuff.

More Stuff.

2018-01-02

Different stuff.

Then the output should be:

[("2018-01-01", "Stuff.\n\nMore stuff."), ("2018-01-02", "Different Stuff.")]

I am using Pythex. My regex is: (\d{4}-\d{2}-\d{2})(.|\n|\r)*?

It matches all of the dates, but it never matches the empty lines or text.

I have the MULTILINE flag set in Pythex.

I have also tried setting DOTALL but it still does not grab anything other than the dates.

41686d6564
  • 15,043
  • 11
  • 32
  • 63
MarkS
  • 1,043
  • 15
  • 26

1 Answers1

1

Enable DOTALL, disable MULTILINE, and use the following regex:

(\d{4}-\d{2}-\d{2})(.*?)(?=\d{4}-\d{2}-\d{2}|$)

Demo.

Details:

  • (\d{4}-\d{2}-\d{2}) Group1 containing the date value. 1

  • (.*?) Group2: matches anything including new lines.

  • (?=\d{4}-\d{2}-\d{2}|$) A positive lookahead to make sure the previous group is followed by either another date value or the end of the string.


1 Note that this doesn't only match date values, it can match values that don't qualify as a date (e.g., 2018-99-99), so you might want to take that into account. You can check this question for ideas on how to validate a date.

41686d6564
  • 15,043
  • 11
  • 32
  • 63