Following code extracts the lines from a file based on its substring occurrence (I'll say keywords here) along with text associated with it:
from itertools import count
def find_content_blocks_by_keywords(lines, keywords):
keyword_indexes = sorted([i for i, line in zip(count(), lines) for
keyword in keywords if re.search(keyword, line)])
return [lines[i:j] for i, j in zip([0]+keyword_indexes, keyword_indexes+[None])]
This is my sample text file
keywords = ['Total item value', 'Total weight', 'Total volume']
lines = ['Total item value RSX 05,018.88\n',
'Total weight 90,969 EUR\n',
'Total volume -97.93 X3 Sca.\n',
'197.939 X3 Sca.']
Substring to be extracted along with their values
result = find_content_blocks_by_keywords(lines, keywords):
Sample Result:
[[],
['Total item value RSX 05,018.88\n'],
['Total weight 90,969 EUR\n'],
['Total volume -97.93 X3 Sca.\n', '197.939 X3 Sca.']]
Can we achieve this using re.findall
or any other re
method directly?
As the content is not fixed in my files, so not able to use certain regular expressions to extract it. Logic is, find the keyword and get all the content in front of it unless next keyword occurs.