How to parse very big files in python?

Question

I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:

def readEvalFileAsDictInverse(evalFile):
  eval = open(evalFile, "r")
  evalIDs = {}
  for row in eval:
    ids = row.split("\t")
    if ids[0] not in evalIDs.keys():
      evalIDs[ids[0]] = []
    evalIDs[ids[0]].append(ids[1])
  eval.close()


  return evalIDs

It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file

Maybe duplicated with https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file — Guillaume Jacquenot, Nov 23 '18 at 06:36
@GuillaumeJacquenot nope, not at all. OP isn't reading the lines first. — Jean-François Fabre, Nov 23 '18 at 07:01
@bib Stepping back a little, what do you plan to use dictionary `eval` for? — Noufal Ibrahim, Nov 23 '18 at 07:59

Jean-François Fabre · Answer 1 · 2018-11-23T08:49:39.580

several issues here:

testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...
why not use a collections.defaultdict instead?
why not use csv module?
overriding eval built-in (well, not really an issue seeing how dangerous it is)

my proposal:

import csv, collections

def readEvalFileAsDictInverse(evalFile):
  with open(evalFile, "r") as handle:
     evalIDs = collections.defaultdict(list)
     cr = csv.reader(handle,delimiter='\t')
     for ids in cr:
        evalIDs[ids[0]].append(ids[1]]

the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if

I don't think it could be faster with default libraries, but a pandas solution probably would.

Patrick Artner · Answer 2 · 2018-11-23T07:56:34.743

Some suggestions:

Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().

dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:

from collections import defaultdict    
def readEvalFileAsDictInverse(evalFile):
  eval = open(evalFile, "r")
  evalIDs = defaultdict(list)
  for row in eval:
    ids = row.split("\t")
    evalIDs[ids[0]].append(ids[1])
  eval.close()

If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.

Something along the lines of

awk -F $'\t' '{print > $1}' file1

will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front

If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.

After splitting the data, load the files as lists again:

Create testfile:

with open ("file.txt","w") as w:

    w.write("""
1\ttata\ti
2\tyipp\ti
3\turks\ti
1\tTTtata\ti
2\tYYyipp\ti
3\tUUurks\ti
1\ttttttttata\ti
2\tyyyyyyyipp\ti
3\tuuuuuuurks\ti

    """)

Code:

# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
    """In case your keys contain non-filename-characters, make it a valid name"""          
    return k # assuming k is a valid file name else modify it

evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
    for line in eval_file:
        if not line.strip():
            continue
        key,value, *rest = line.split("\t") # omit ,*rest if you only have 2 values
        fn = files.setdefault(key, make_filename(key))

        # this wil open and close files _a lot_ you might want to keep file handles
        # instead in your dict - but that depends on the key/data/lines ratio in 
        # your data - if you have few keys, file handles ought to be better, if 
        # have many it does not matter
        with open(fn,"a") as f:
            f.write(value+"\n")

# create your list data from your files:
data = {}
for key,fn in files.items():
    with open(fn) as r:
        data[key] = [x.strip() for x in r]

print(data)

Output:

# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'], 
 '2': ['yipp', 'YYyipp', 'yyyyyyyipp'], 
 '3': ['urks', 'UUurks', 'uuuuuuurks']}

@Jean-FrançoisFabre copy & paste error - thanks for pointing out — Patrick Artner, Nov 23 '18 at 07:13

score 1 · Answer 3 · answered Nov 23 '18 at 06:53

1

Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.
Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.

answered Nov 23 '18 at 06:53

Noufal Ibrahim

66,768
11
123
160

can you please give more details. i dont have any idea about multiprocessing.pool – bib Nov 23 '18 at 06:56
parallelize the loading won't do no good if I/O is the bottleneck – Jean-François Fabre Nov 23 '18 at 06:57
If I/O is the bottleneck, then yes, it won't do much good but apart from the `defaultdict` which everyone has suggested, it's the only other thing I can think of worth trying. – Noufal Ibrahim Nov 23 '18 at 06:59
@bib Are all the lines in the file of the same length? – Noufal Ibrahim Nov 23 '18 at 07:01

score 0 · Accepted Answer · answered Nov 23 '18 at 06:45

0

Maybe, you can make it somewhat faster; change it:

if ids[0] not in evalIDs.keys():
      evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])

to

evalIDs.setdefault(ids[0],[]).append(ids[1])

The 1st solution searches 3 times in the "evalID" dictionary.

answered Nov 23 '18 at 06:45

kantal

2,136
2
5
14

`setdefault` is slower than a defaultdict. `timeit.timeit(lambda : d.setdefault('x',[]).append(1))` reports `0.4583683079981711` and `timeit.timeit(lambda : c['x'].append(1))` reports `0.28720847200020216` where `d` is `{}` and `c` is `collections.defaultdict(list)`. All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned. – Noufal Ibrahim Nov 26 '18 at 04:45
I can't measure significant difference (Python 3.7.1), but the OP should measure it. – kantal Nov 26 '18 at 07:21
The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time). – Noufal Ibrahim Nov 26 '18 at 09:13

How to parse very big files in python?

4 Answers4