Correct me if I'm wrong, but a TSV file is basically a CSV file, using a tab character instead of a comma. To translate this in python efficiently, you need to iterate through the lines of your source file, replace the tabs with commas, and write the new line to the new file. You don't need to use any module to do this, writing the solution in Python is actually quite simple:
def tsv_to_csv(filename):
ext_index = filename.rfind('.tsv')
if ext_index == -1:
new_filename = filename + '.csv'
else:
new_filename = filename[:ext_index] + '.csv'
with open(filename) as original, open(new_filename, 'w') as new:
for line in original:
new.write(line.replace('\t', ','))
return new_filename
Iterating through the lines like this only loads each line into memory one by one, instead of loading the whole thing into memory. It might take a while to process 12GB of data though.
UPDATE:
In fact, now that I think about it, it may be significantly faster to use binary I/O on such a large file, and then to replace the tabs with commas on large chunks of the file at a time. This code follows that strategy:
from io import FileIO
# This chunk size loads 1MB at a time for conversion.
CHUNK_SIZE = 1 << 20
def tsv_to_csv_BIG(filename):
ext_index = filename.rfind('.tsv')
if ext_index == -1:
new_filename = filename + '.csv'
else:
new_filename = filename[:ext_index] + '.csv'
original = FileIO(filename, 'r')
new = FileIO(new_filename, 'w')
table = bytes.maketrans(b'\t', b',')
while True:
chunk = original.read(CHUNK_SIZE)
if len(chunk) == 0:
break
new.write(chunk.translate(table))
original.close()
new.close()
return new_filename
On my laptop using a 1GB TSV file, the first function takes 4 seconds to translate to CSV while the second function takes 1 second. Tuning the CHUNK_SIZE parameter might speed it up more if your storage can keep up, but 1MB seems to be the sweet spot for me.
Using tr
as mentioned in another answer took 3 seconds for me, so the chunked python approach seems fastest.