-3

I am working on parsing the document which is not structured and trying to parse it based on some predefined placeholders. I am trying to parse a document that has text something similar.

CARD DIVISION
                                                Due Date                        Minimum Amount Due ($.)
John Doe,
Bank Division,
Email : contact@bankkofamerica.com
                                                  11-MAY-2020                                      1,000
web : https://www.bankkofamerica.com

Notice the Due Date (11-May-2020) is directly below the placeholder. So is the minimum amount due($) $1,000

How can we extract the Due Date and Minimum amount due using regular expressions?

halfer
  • 18,701
  • 13
  • 79
  • 158
Kiran
  • 6,858
  • 30
  • 95
  • 156
  • This single example is not a lot to go on. Are you saying that for multiple similar documents, the date and 'due' value are always positioned horizontally in line with their headings? How do you decide you've reached the correct line? Is it always the line after 'Email'? Or the first line that doesn't start with some other character than a space? – Grismar Jun 04 '20 at 06:29
  • Yes, it is directly below, because data is extracted from a "Table". I just updated, to make it more clear. – Kiran Jun 04 '20 at 06:33
  • I fail to understand why was this question closed. Can you please point to the exact reference where this query has been answered. Please don't turn StackOverflow into Wikipedia. Most of the programmers come to this esteemed site with specific questions expecting a specific answer. There are lot of kind souls help answer the question. Pointing a Wikipedia style article is a gross misuse of your privilege bestowed upon by the community. – Kiran Jun 04 '20 at 08:20
  • I think the message here is that you didn't try to write a regex yourself and the job at hand is fairly straightforward. Try to write one with the available information (that linked question has a ton, but there's lots of other great resources like regex101.com) and if you cannot get it to work, come back with a question about your code instead of an open question asking others to come up with code. By the way, I didn't vote to close, but I sympathise. – Grismar Jun 04 '20 at 09:24

1 Answers1

1

Here is a regex based option using re.findall:

inp = """                                                  Due Date                        
Minimum Amount Due ($.)
John Doe,
Bank Division,
Email : contact@bankkofamerica.com
                                              11-MAY-2020                                      1,000
web : https://www.bankkofamerica.com"""

matches = re.findall(r'Due Date\b(?:(?!\bDue Date\b).)*(\d{2}-[A-Z]{3}-\d{4})\s+(\d{1,3}(?:,\d{3})*)', inp, flags=re.DOTALL)
print(matches)

This prints:

[('11-MAY-2020', '1,000')]

The idea behind the above regex is to match the marker text Due Date, and then match until finding the actual text for the due date and the amount. Note that we use dot all mode to match across newlines.

Tim Biegeleisen
  • 387,723
  • 20
  • 200
  • 263