1

I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:

def readEvalFileAsDictInverse(evalFile):
  eval = open(evalFile, "r")
  evalIDs = {}
  for row in eval:
    ids = row.split("\t")
    if ids[0] not in evalIDs.keys():
      evalIDs[ids[0]] = []
    evalIDs[ids[0]].append(ids[1])
  eval.close()


  return evalIDs 

It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file

Erik Šťastný
  • 1,274
  • 8
  • 37
bib
  • 635
  • 1
  • 9
  • 26

4 Answers4

3

several issues here:

  • testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...
  • why not use a collections.defaultdict instead?
  • why not use csv module?
  • overriding eval built-in (well, not really an issue seeing how dangerous it is)

my proposal:

import csv, collections

def readEvalFileAsDictInverse(evalFile):
  with open(evalFile, "r") as handle:
     evalIDs = collections.defaultdict(list)
     cr = csv.reader(handle,delimiter='\t')
     for ids in cr:
        evalIDs[ids[0]].append(ids[1]]

the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if

I don't think it could be faster with default libraries, but a pandas solution probably would.

Jean-François Fabre
  • 126,787
  • 22
  • 103
  • 165
2

Some suggestions:

Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().

dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:

from collections import defaultdict    
def readEvalFileAsDictInverse(evalFile):
  eval = open(evalFile, "r")
  evalIDs = defaultdict(list)
  for row in eval:
    ids = row.split("\t")
    evalIDs[ids[0]].append(ids[1])
  eval.close()

If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.

Something along the lines of

awk -F $'\t' '{print > $1}' file1

will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front


If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.

After splitting the data, load the files as lists again:

Create testfile:

with open ("file.txt","w") as w:

    w.write("""
1\ttata\ti
2\tyipp\ti
3\turks\ti
1\tTTtata\ti
2\tYYyipp\ti
3\tUUurks\ti
1\ttttttttata\ti
2\tyyyyyyyipp\ti
3\tuuuuuuurks\ti

    """)

Code:

# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
    """In case your keys contain non-filename-characters, make it a valid name"""          
    return k # assuming k is a valid file name else modify it

evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
    for line in eval_file:
        if not line.strip():
            continue
        key,value, *rest = line.split("\t") # omit ,*rest if you only have 2 values
        fn = files.setdefault(key, make_filename(key))

        # this wil open and close files _a lot_ you might want to keep file handles
        # instead in your dict - but that depends on the key/data/lines ratio in 
        # your data - if you have few keys, file handles ought to be better, if 
        # have many it does not matter
        with open(fn,"a") as f:
            f.write(value+"\n")

# create your list data from your files:
data = {}
for key,fn in files.items():
    with open(fn) as r:
        data[key] = [x.strip() for x in r]

print(data)

Output:

# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'], 
 '2': ['yipp', 'YYyipp', 'yyyyyyyipp'], 
 '3': ['urks', 'UUurks', 'uuuuuuurks']}
Patrick Artner
  • 43,256
  • 8
  • 36
  • 57
1
  1. Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.
  2. Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.
Noufal Ibrahim
  • 66,768
  • 11
  • 123
  • 160
0

Maybe, you can make it somewhat faster; change it:

if ids[0] not in evalIDs.keys():
      evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])

to

evalIDs.setdefault(ids[0],[]).append(ids[1])

The 1st solution searches 3 times in the "evalID" dictionary.

kantal
  • 2,136
  • 2
  • 5
  • 14
  • `setdefault` is slower than a defaultdict. `timeit.timeit(lambda : d.setdefault('x',[]).append(1))` reports `0.4583683079981711` and `timeit.timeit(lambda : c['x'].append(1))` reports `0.28720847200020216` where `d` is `{}` and `c` is `collections.defaultdict(list)`. All the answers have recommended so. Why have you selected this as the correct one? The solution here is inferior to the others mentioned. – Noufal Ibrahim Nov 26 '18 at 04:45
  • I can't measure significant difference (Python 3.7.1), but the OP should measure it. – kantal Nov 26 '18 at 07:21
  • The defaultdict is roughly twice as fast as the setdefault in my measurement. (3.5.3) and I think that's reasonable given how setdefault evaluates its arguments each time you call it (a new empty list is created each time). – Noufal Ibrahim Nov 26 '18 at 09:13