2

I have a large tsv file (around 12 GB) that I want to convert to a csv file. For smaller tsv files, I use the following code, which works but is slow:

import pandas as pd

table = pd.read_table(path of tsv file, sep='\t')
table.to_csv(path andname_of csv_file, index=False)

However, this code does not work for my large file, and the kernel resets in the middle.

Is there any way to fix the problem? Does anyone know if the task is doable with Dask instead of Pandas?

I am using windows 10.

MarredCheese
  • 9,495
  • 5
  • 59
  • 63
Monirrad
  • 375
  • 4
  • 14
  • Are you working on a unix based system? – Allan Mar 11 '19 at 02:10
  • If yes you can just use the following command: `tr '\t' ',' output.csv` `,` can be changed by another field separator – Allan Mar 11 '19 at 02:16
  • I work on Windows 10. Your second comment is not clear for me to understand – Monirrad Mar 11 '19 at 02:21
  • If we have answered your question, could you please accept an answer/upvote as detailed in: https://stackoverflow.com/help/someone-answers – Allan Mar 27 '19 at 08:58

4 Answers4

3

Instead of loading all lines at once in memory, you can read line by line and process them one after another:

With Python 3.x:

fs=","
table = str.maketrans('\t', fs)
fName = 'hrdata.tsv'
f = open(fName,'r')

try:
  line = f.readline()
  while line:
    print(line.translate(table), end = "")
    line = f.readline()

except IOError:
  print("Could not read file: " + fName)

finally:
  f.close()

Input (hrdata.tsv):

Name    Hire Date       Salary  Sick Days remaining
Graham Chapman  03/15/14        50000.00        10
John Cleese     06/01/15        65000.00        8
Eric Idle       05/12/14        45000.00        10
Terry Jones     11/01/13        70000.00        3
Terry Gilliam   08/12/14        48000.00        7
Michael Palin   05/23/13        66000.00        8

Output:

Name,Hire Date,Salary,Sick Days remaining
Graham Chapman,03/15/14,50000.00,10
John Cleese,06/01/15,65000.00,8
Eric Idle,05/12/14,45000.00,10
Terry Jones,11/01/13,70000.00,3
Terry Gilliam,08/12/14,48000.00,7
Michael Palin,05/23/13,66000.00,8

Command:

python tsv_csv_convertor.py > new_csv_file.csv

Note:

If you use a Unix env, just run the command:

tr '\t' ',' <input.tsv >output.csv
Allan
  • 11,170
  • 3
  • 22
  • 43
1

Correct me if I'm wrong, but a TSV file is basically a CSV file, using a tab character instead of a comma. To translate this in python efficiently, you need to iterate through the lines of your source file, replace the tabs with commas, and write the new line to the new file. You don't need to use any module to do this, writing the solution in Python is actually quite simple:

def tsv_to_csv(filename):
    ext_index = filename.rfind('.tsv')
    if ext_index == -1:
        new_filename = filename + '.csv'
    else:
        new_filename = filename[:ext_index] + '.csv'

    with open(filename) as original, open(new_filename, 'w') as new:
        for line in original:
            new.write(line.replace('\t', ','))

    return new_filename

Iterating through the lines like this only loads each line into memory one by one, instead of loading the whole thing into memory. It might take a while to process 12GB of data though.

UPDATE: In fact, now that I think about it, it may be significantly faster to use binary I/O on such a large file, and then to replace the tabs with commas on large chunks of the file at a time. This code follows that strategy:

from io import FileIO

# This chunk size loads 1MB at a time for conversion.
CHUNK_SIZE = 1 << 20


def tsv_to_csv_BIG(filename):
    ext_index = filename.rfind('.tsv')
    if ext_index == -1:
        new_filename = filename + '.csv'
    else:
        new_filename = filename[:ext_index] + '.csv'

    original = FileIO(filename, 'r')
    new = FileIO(new_filename, 'w')
    table = bytes.maketrans(b'\t', b',')

    while True:
        chunk = original.read(CHUNK_SIZE)
        if len(chunk) == 0:
            break
        new.write(chunk.translate(table))

    original.close()
    new.close()
    return new_filename

On my laptop using a 1GB TSV file, the first function takes 4 seconds to translate to CSV while the second function takes 1 second. Tuning the CHUNK_SIZE parameter might speed it up more if your storage can keep up, but 1MB seems to be the sweet spot for me.

Using tr as mentioned in another answer took 3 seconds for me, so the chunked python approach seems fastest.

Broseph
  • 1,479
  • 12
  • 35
0

You can use Python's built-in read and write to rewrite the file line by line. This may take some time to process depending on your file size, but it shouldn't run out of memory since you're working line by line.

with open("input.tsv", "r") as input_file:
    for line in input_file:
        with open("output.csv", "a") as output:
            line = line.replace("\t", ",")
            output.write(line)
jfaccioni
  • 4,793
  • 1
  • 5
  • 21
  • This answer should work, but it needlessly opens and closes the output file for every line in the input, which could happen quite a lot if the input is 12GB large. This might slow it down a lot. It would be better to open the output file outside the for loop, and use mode 'w' instead of 'a'. With those changes your answer essentially becomes the same as mine. – Broseph Mar 11 '19 at 02:41
  • I agree. I wrote my code trying to be as readable as possible, since OP doesn't seem to have any knowledge in manipulating text files directly through Python. He could look into optimizing the code later if he feels the need for it, or he could leave it running for a bit it he just needs it for this particular file for now. – jfaccioni Mar 11 '19 at 02:47
0

You can use chunksize to iterate over the entire file in pieces. Note that this uses .read_csv() instead of .read_table()

df = pd.DataFrame()
for chunk in pd.read_csv('Check1_900.csv', header=None, names=['id', 'text', 'code'], chunksize=1000):
    df = pd.concat([df, chunk], ignore_index=True)

source


You can also try the low_memory=False flag (source).


And then next would be the memory_map (scroll down at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

memory_map : bool, default False

If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

Note that to_csv() has similar functionality.

Community
  • 1
  • 1
philshem
  • 22,161
  • 5
  • 54
  • 110