0

Suppose I have plain text that contains the following ordered list in a plain-text file across multiple lines.

This is a text\n
that contains an ordered/numbered list\n
appearing on multiple lines in a plain-text file.\n
\n
Item 1. This is a list where each item can span over\n
 multiple lines\n
Item 2. that I want to extract each separate item from but ONLY in series (order)\n
Item 3. non-blank text\n
Item 4. non-blank text\n
Item 5. non-blank text\n
Item 6. non-blank text\n
Item 7. non-blank text\n
Item 8. non-blank text\n
Item 9. non-blank text\n
Item 10. non-blank text\n
Item 11. The items are in an ordered list, but digits may repeat (11, 22)\n
 or they may be preceded or folowed by another digit (20, 35, 300) with\n
...
Item 999. Up to 999 items\n
 in each ordered list\n
\n
But, (most annoyingly), any Item n (with up to 3 digits) or Items may be repeated\n
 or back-referenced later in text but not\n
 again as an ordered list (or in series) as the first\n
 instance of each item in the list above.

Desired capture/output from regex:

Return the text of each item (potentially across multiple lines) as it appears in the ordered list.

Item 1. [Text]\n

Item 2. [Text]\n

[Text may span multi-line]

Item N (up to 999). [Text]\n

My current best regex construction is as follows:

(Item\s[\d]+\. )(.*?)(?=(Item\s[\d]+\.)|($))

The above regex construction does not greedily include newlines or multiple lines in each 'item' captured from the ordered list above.

My question: Is it possible using regex in Python to extract just the items in the ordered list? And if not possible using regex, how would I most efficiently go about 'locating' the ordered list in a text such as this using Python and extract it?

Community
  • 1
  • 1
DV Hughes
  • 295
  • 1
  • 5
  • 18

1 Answers1

0

Use the DOTALL flag for python regex.

re.compile('(Item\s[\d]+\. )(.*?)(?=(Item\s[\d]+\.)|($))', re.DOTALL)

shockawave123
  • 648
  • 4
  • 15