3

I want to print sections of a file from a matching line to an empty line, so I am looking for a way to express

sed '/^Word .*/,/^$/'

in Python.

For instance if |I had a file containing these sections:

Fruits
Apples:  10
Oranges: 20
Bananas:  5

Pastry
Cupcakes: 5
Buns:    10
Waffles: 20

How do I get the Fruits section?

In Perl I could do:

if ( /^Fruits/ .. /^$/ ) {
    print;
}

But I don't know how to do this in Python.

Borodin
  • 123,915
  • 9
  • 66
  • 138
  • use a state machine... once you find the line you are looking for, use a boolean variable to indicate start of printing... then when you hit an empty line, set it to false.. see https://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python for reading file line by line – Sundeep Oct 30 '17 at 16:06
  • thanks, could you elaborate..? im very new to python :) – Dan-Simon Myrland Oct 30 '17 at 16:14
  • see https://stackoverflow.com/questions/11732383/python-read-specific-lines-of-text-between-two-strings and https://stackoverflow.com/questions/31786823/print-lines-between-two-patterns-in-python – Sundeep Oct 30 '17 at 16:27
  • Ah, thanks! This helps :) – Dan-Simon Myrland Oct 30 '17 at 16:29
  • Does this have to be Python? It's easier in most other script languages. – Borodin Oct 30 '17 at 16:32
  • Or, if your files aren't huge, you can read it whole and use either `split` or `search` with regex. See [this post](https://stackoverflow.com/q/18568105/4653379) for example. There's no "flip-flop" operator in Python I think, but there are libraries implementing it -- or you can write a class/function to implement it, by using a flag while reading line by line. – zdim Oct 30 '17 at 16:32
  • Thanks zdim I'll check it out. Yes, Borodin, I too find this much easier in other scripting languages. I wanted to try this out in Python simply as a learning experiance :) – Dan-Simon Myrland Oct 30 '17 at 16:36
  • If it's for learning I would definitely suggest to write a function, or better yet a small class, that implements a sensible range (flip-flop) operator – zdim Oct 31 '17 at 17:24

2 Answers2

2

I think you are looking for regex.

The following example extracts your sections using a regular expression:

import re

txt = """Fruits
Apples:  10
Oranges: 20
Bananas:  5

Pastry
Cupcakes: 5
Buns:    10
Waffles: 20"""

print re.findall("Fruits.*?(?:\n\n|$)", txt, re.DOTALL)
print
print re.findall("Pastry.*?(?:\n\n|$)", txt, re.DOTALL)

Here, findall will return a list with all occurrences of "Word.*?(?:\n\n|$)"in the string called txt. The regex used here means any sequence of characters which starts with Word, followed by any character . occurring 0 or more times in a non-greedy mode *?. Finally (?:\n\n|$) ensures the sequence ends with either a double newline \n\n or a end-of-string $. The option re.DOTALL ensures that . includes newline.

snake_charmer
  • 2,204
  • 3
  • 18
  • 35
  • Thanks @zdim for your constructive comment. Unfortunately `"\n\n"` was not enough to identify the end of any paragraph, because the last one is not followed by an empty line. One should look for a sequence ending with either of those character. The `(?:\n\n|$)` expression worked for me. – snake_charmer Oct 30 '17 at 18:38
  • The `\n\n` is generally used to identify a paragraph -- except that it may not work for the very last one, as you noticed :) Adding `$` is a good fix (and one can also add optional spaces there, `\s*$`, to effectively trim the possible trailing whitespace). – zdim Oct 31 '17 at 17:19
2

You could split the string by "\n\n" and look for string that start with Fruits

print(*(i for i in s.split("\n\n") if i.startswith("Fruits")))

Or if you have multiple groups:

print('\n\n'.join((i for i in s.split("\n\n") if i.startswith("Fruits"))))

Returns:

Fruits
Apples:  10
Oranges: 20
Bananas:  5

If:

s = """Fruits
Apples:  10
Oranges: 20
Bananas:  5

Pastry
Cupcakes: 5
Buns:    10
Waffles: 20"""

furthermore You could also extract the items by a single line:

fruits = [i for i in s.split("\n\n") if i.startswith("Fruits")][0]
fruitdict = dict((i.strip() for i in i.split(":")) for i in fruits.split('\n')[1:])
fruitdict

Returns:

{'Apples': '10', 'Bananas': '5', 'Oranges': '20'}

or extract all categories:

categories = [i for i in s.split("\n\n")]

d = {}
for item in categories:
    rows = item.split('\n')
    d[rows[0]] = dict((i.strip() for i in i.split(":")) for i in rows[1:])
    # d[rows[0]] = dict((i.split(":")[0],int(i.split(":")[1])) for i in rows[1:])

d

Returns:

{'Fruits': {'Apples': '10', 'Bananas': '5', 'Oranges': '20'},
 'Pastry': {'Buns': '10', 'Cupcakes': '5', 'Waffles': '20'}}
Anton vBR
  • 15,331
  • 3
  • 31
  • 42