1

Following code extracts the lines from a file based on its substring occurrence (I'll say keywords here) along with text associated with it:

from itertools import count

def find_content_blocks_by_keywords(lines, keywords):
    keyword_indexes = sorted([i for i, line in zip(count(), lines) for 
    keyword in keywords if re.search(keyword, line)])  
    return [lines[i:j] for i, j in zip([0]+keyword_indexes, keyword_indexes+[None])]

This is my sample text file

keywords = ['Total item value', 'Total weight', 'Total volume']
lines = ['Total item value RSX 05,018.88\n',
  'Total weight 90,969 EUR\n',
  'Total volume -97.93 X3 Sca.\n',
  '197.939 X3 Sca.']

Substring to be extracted along with their values

result = find_content_blocks_by_keywords(lines, keywords):

Sample Result:

[[],
 ['Total item value RSX 05,018.88\n'],
 ['Total weight 90,969 EUR\n'],
 ['Total volume -97.93 X3 Sca.\n', '197.939 X3 Sca.']]

Can we achieve this using re.findall or any other re method directly?

As the content is not fixed in my files, so not able to use certain regular expressions to extract it. Logic is, find the keyword and get all the content in front of it unless next keyword occurs.

Laxmikant
  • 1,404
  • 2
  • 22
  • 40
  • Is it always the case that there is a newline before a next keyword? – Sven Krüger May 28 '18 at 05:57
  • @Sven Krüger- Yes – Laxmikant May 28 '18 at 06:09
  • Do any of the answers to [How to parse complex text files using Python?](https://stackoverflow.com/questions/47982949/how-to-parse-complex-text-files-using-python) help? – Mike Robins May 28 '18 at 08:16
  • @MikeRobins - Thanks, let me take a look at it. – Laxmikant May 28 '18 at 08:34
  • Your code is not working the way you described. What is `content`? If I replace it with `lines`, [it produces a list of lists](http://rextester.com/LYBO46416). – Wiktor Stribiżew May 28 '18 at 12:19
  • Hi @WiktorStribiżew - Sorry, there were some typos (as I had edited manually). Please check now and suggest a solution if any. Thanks – Laxmikant May 28 '18 at 12:58
  • Hm, check out [`print(re.findall(r'(?m)^(?:{0}).*(?:[\r\n]+(?!(?:{0})).*)*'.format("|".join([re.escape(x) for x in keywords])), "\n".join(lines)))`](http://rextester.com/MBFXL38321) – Wiktor Stribiżew May 28 '18 at 13:10
  • @WiktorStribiżew - Just checked, yes its perfect. Thank you very much. please suggest me some references to get expertise in the Regex. I'm good at basics but did not know about `?:`, [\r\n] etc. Please answer I will accept it. – Laxmikant May 29 '18 at 07:48
  • 1
    I do not know your level of regex knowledge :) so that I can only suggest doing all lessons at [regexone.com](http://regexone.com/), reading through [regular-expressions.info](http://www.regular-expressions.info), [regex SO tag description](http://stackoverflow.com/tags/regex/info) (with many other links to great online resources), and the community SO post called [What does the regex mean](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). Also, [rexegg.com](http://rexegg.com) is worth having a look at. – Wiktor Stribiżew May 29 '18 at 07:55
  • @WiktorStribiżew - Thanks for the references. :) – Laxmikant May 29 '18 at 08:00

1 Answers1

1

Here is what a fix I suggest:

from itertools import count
import re

keywords = ['Total item value', 'Total weight', 'Total volume']
lines = ['Total item value RSX 05,018.88\n',
  'Total weight 90,969 EUR\n',
  'Total volume -97.93 X3 Sca.\n',
  '197.939 X3 Sca.']

pat = r'(?m)^(?:{0}).*(?:[\r\n]+(?!(?:{0})).*)*'.format("|".join([re.escape(x) for x in keywords]))
print(re.findall(pat, "\n".join(lines)))

Output of the Python demo:

['Total item value RSX 05,018.88\n', 'Total weight 90,969 EUR\n', 'Total volume -97.93 X3 Sca.\n\n197.939 X3 Sca.']

Pattern description

  • (?m) - re.MULTILINE modifier making ^ match start of lines
  • ^ - start of a line
  • (?:{0}) - a non-capturing group that will contain alternatives listed with the | alternation operator (e.g. Total item value|Total weight|Total volume)
  • .* - any 0+ chars other than LF (the rest of the line)
  • (?:[\r\n]+(?!(?:{0})).*)* - 0 or more repetitions of:
    • [\r\n]+(?!(?:{0})) - 1 or more LF or/and CR symbols ([\r\n]+) not followed with any of the keywords items
    • .* - the rest of the line
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397