1

Say I have a list of tens of thousands of entries, and I want to write them to files. If the item in the list meets some criteria, I'd like to close the current file and start a new one.

I'm having a couple of issues, I think they're stemming from the fact that I want to name the files be based on the first entry in that file. Also, the signal to start a new file is based on whether an entry has a field that is the same as the previous one. So, for example imagine I have the list:

l = [('name1', 10), ('name1', 30), ('name2', 5), ('name2', 7), ('name2', 3), ('name3', 10)]

I'd want to end up with 3 files, name1.txt should contain 10 and 30, name2.txt should have 5, 7 and 3, and name3.txt should have 10. The list is already sorted by the first element, so all I need to do is check if the first element is the same as the previous and if not, start a new file.

At first I tried:

name = None
for entry in l:
    if entry[0] != name:
        out_file.close()
        name = entry[0]
        out_file = open("{}.txt".format(name))
        out_file.write("{}\n".format(entry[1]))
    else:
        out_file.write("{}\n".format(entry[1]))

out_file.close()

There are a couple of problems with this as far as I can tell. First, the first time through the loop, there's no out_file to close. Second, I can't close the last out_file created, since it's defined inside the loop. The following solves the first problem, but seems clunky:

for entry in l:
    if name:
        if entry[0] != name:
            out_file.close()
            name = entry[0]
            out_file = open("{}.txt".format(name))
            out_file.write("{}\n".format(entry[1]))
        else:
            out_file.write("{}\n".format(entry[1]))
    else:
        name = entry[0]
        out_file = open("{}.txt".format(name))
        out_file.write("{}\n".format(entry[1]))

out_file.close()

Is there a better way to do this?

And also, this doesn't seem like it should solve the problem of closing the last file, though this code runs fine - am I misunderstanding the scope of out_file? I thought it would be restricted to inside the for loop.

EDIT: I should probably have mentioned, my data is far more complex than indicated here... it's not actually in a list, it's a SeqRecord from BioPython

EDIT 2: OK, I thought I was simplifying in order to avoid distraction. Apparently had the opposite effect - mea culpa. The following is the equivalent of the second code block above, :

from re import sub
from Bio import SeqIO

def gbk_to_faa(some_genbank):
    source = None
    for record in SeqIO.parse(some_genbank, 'gb'):
        if source:
            if record.annotations['source'] != source:
                out_file.close()
                source = sub(r'\W+', "_", sub(r'\W$', "", record.annotations['source']))
                out_file = open("{}.faa".format(source), "a+")
                write_all_record(out_file, record)
            else:
                write_all_record(out_file, record)
        else:
            source = sub(r'\W+', "_", sub(r'\W$', "", record.annotations['source']))
            out_file = open("{}.faa".format(source), "a+")
            write_all_record(out_file, record)

    out_file.close()


def write_all_record(file_handle, gbk_record):
    # Does more stuff, I don't think this is important
    # If it is, it's in this gist: https://gist.github.com/kescobo/49ab9f4b08d8a2691a40
Stefan
  • 35,233
  • 11
  • 66
  • 76
kevbonham
  • 924
  • 6
  • 19
  • As to your last question about if `out-file` would be restricted to the for loop check out [Scoping in Python for loops](https://stackoverflow.com/questions/3611760/scoping-in-python-for-loops) – LinkBerest Jan 13 '16 at 21:44
  • @JGreenwell - OK, so basically I completely misunderstand the scope of things in python. Good to know... – kevbonham Jan 14 '16 at 12:22

3 Answers3

5

It is easier to use the tools Python provides:

from itertools import groupby
from operator import itemgetter

items = [
    ('name1', 10), ('name1', 30),
    ('name2', 5), ('name2', 7), ('name2', 3),
    ('name3', 10)
]

for name, rows in groupby(items, itemgetter(0)):
    with open(name + ".txt", "w") as outf:
        outf.write("\n".join(str(row[1]) for row in rows))

Edit: to match the updated question, here is the updated solution ;-)

for name, records in groupby(SeqIO.parse(some_genbank, 'gb'), lambda record:record.annotations['source']):
    with open(name + ".faa", "w+") as outf:
        for record in records:
            write_all_record(outf, record)
Hugh Bothwell
  • 50,702
  • 6
  • 75
  • 95
  • Sorry, I should have mentioned, my data is a bit more complicated than I indicated (see edit). Still, this might useful - will have to see if I can figure out if it's compatible. – kevbonham Jan 13 '16 at 21:49
  • This solution really doesn't depend upon the simplicity of your data. It will work for any sorted sequence. Just make sure that the key function to groupby is the compatible with the key function that sorted the data. – Robᵩ Jan 13 '16 at 21:51
  • @Robᵩ I just posted the actual code. The data is sorted in the file that I'm parsing. So I need to read in the file, parse it, and then split it according to a certain criteria: `record.annotations['source']`. The `record` is an object returned by the iterator, and `record.annotations` is a dictionary (I think). Can I use the dot operator with itemgetter? – kevbonham Jan 14 '16 at 20:56
  • Looks like maybe I can use `operator.attrgetter`? [python Docs](https://docs.python.org/2/library/operator.html#operator.attrgetter) – kevbonham Jan 14 '16 at 21:08
  • @kevbonham: I think you will have to use `lambda record: record.annotations['source']`. – Hugh Bothwell Jan 14 '16 at 21:11
  • @HughBothwell Rad - that worked! I put the working code at the bottom of this [gist](https://gist.github.com/kescobo/49ab9f4b08d8a2691a40) - do you want to add it to your answer and then I'll accept it? – kevbonham Jan 14 '16 at 21:18
2

If you don't mind using pandas, you could deal with this as follows:

import pandas as pd
df = pd.DataFrame(l)
df.columns = ['name', 'value']
df.groupby('name').apply(lambda x: x.to_csv('{}.txt'.format(x['name'].iloc[0]), index=False, header=False))

to get three text files named name1.txt etc that look like:

name1,10
name1,30
Stefan
  • 35,233
  • 11
  • 66
  • 76
  • The structure of my data is significantly more complex than I'm indicating here, this example was just for illustrative purposes. I don't think loading into a `DataFrame` would work very well. – kevbonham Jan 13 '16 at 21:43
  • Well how about an illustrative example that better illustrates the case at hand? Also the `pandas` tag suggests relevance where there seems to be none? – Stefan Jan 13 '16 at 21:46
  • Err... my bad. That was left over from a question I started asking last week and then figured out on my own. Evidently it was cached or something. – kevbonham Jan 13 '16 at 22:31
  • So do your data look like the result of `print record` from the page you linked, but as a tuple? How about posting a sample? – Stefan Jan 14 '16 at 09:16
  • Just posted the actual code that works, does that reveal the structure of the data sufficiently or should I post something else? It's an iterator that contains several things, including a dictionary that I need to grab a particular item from. – kevbonham Jan 14 '16 at 20:51
1

Without messing with your code, why don't you just check if the out_file variable exists before closing?

out_file=None
... #Some code
if out_file:
    out_file.close()

You could also use a try/except for this.

Or maybe even mayking a class (although overkill):

class f_temp():
    name = None
    def close(self):
        pass
out_file = f_temp()

for entry in l:
    if entry[0] != out_file.name:

...

Now reading a bit more, why don't you sort your data by filename, and just open one file at the time?


You could also use a dictionary for this:

file_dict =dict()
for filename, value in l:
    if filename not in file_dict():
        file_dict[filename] = open("{}.txt".format(filename))
    file_dict[filename].write("{}\n".format(entry[1]))

for item in file_dict.items():
    item.close()
tglaria
  • 4,973
  • 2
  • 11
  • 16