0

I am trying to read a large txt file (1.6 GB) in order to store its content in dictionaries. I am facing a hard time reading the file and it takes a lot to be over with. Actually I don´t know the exact time required to run the whole code because I stopped after 10 min of waiting :(.

Here is the code:

import numpy as np
import pylab as pl
import matplotlib.pyplot as plt
import fileinput
import time



def extractdata2():
    start_time = time.time()
    accel_data = { 'timestamp': [], 'sensor': [], 'x': [], 'y': [], 'z': [] }
    accel_uncalib_data = { 'timestamp': [], 'sensor': [], 'x_uncalib': [], 'y_uncalib': [], 'z_uncalib': [], 'x_bias': [], 'y_bias': [], 'z_bias': [] }
    gyro_data = { 'timestamp': [], 'sensor': [], 'x': [], 'y': [], 'z': []}
    gyro_uncalib_data = { 'timestamp': [], 'sensor': [], 'x_uncalib': [], 'y_uncalib': [], 
'z_uncalib': [], 'x_drift': [], 'y_drift': [], 'z_drift': []}
    magnet_data = { 'timestamp': [], 'sensor': [], 'x': [], 'y': [], 'z': [] }
    magnet_uncalib_data = { 'timestamp': [], 'sensor': [], 'x_uncalib': [], 'y_uncalib': [], 'z_uncalib': [], 'x_bias': [], 'y_bias': [], 'z_bias': []}

    with open("accelerometer.txt") as myfile:
        for line in myfile:
            line = line.split(',')
            if "TYPE_ACCELEROMETER" in line:
                    #IMU_data["accel_data"] = line  # the line must be split in 4 camps

                accel_data["timestamp"].append(line[ 0 ] )
                accel_data["sensor"].append( line[ 1 ] )
                accel_data["x"].append( line[ 2 ] )
                accel_data["y"].append( line[ 3 ] )
                accel_data["z"].append( line[ 4 ] )
                #print(accel_data)
            elif "TYPE_ACCELEROMETER_UNCALIBRATED" in line:
                accel_uncalib_data["timestamp"].append( line[ 0 ] )
                accel_uncalib_data["sensor"].append( line[ 1 ] )
                accel_uncalib_data["x_uncalib"].append( line[ 2 ] )
                accel_uncalib_data["y_uncalib"].append( line[ 3 ] )
                accel_uncalib_data["z_uncalib"].append( line[ 4 ] )
                accel_uncalib_data["x_bias"].append( line[ 5 ] )
                accel_uncalib_data["y_bias"].append( line[ 6 ] )
                accel_uncalib_data["z_bias"].append( line[ 7 ] )
                #print(accel_uncalib_data)
            elif "TYPE_GYROSCOPE" in line:
                gyro_data["timestamp"].append( line[ 0 ] )
                gyro_data["sensor"].append( line[ 1 ] )
                gyro_data["x"].append( line[ 2 ] )
                gyro_data["y"].append( line[ 3 ] )
                gyro_data["z"].append( line[ 4 ] )
                #print(gyro_data)
            elif "TYPE_GYROSCOPE_UNCALIBRATED" in line:
                gyro_uncalib_data["timestamp"].append( line[ 0 ] )
                gyro_uncalib_data["sensor"].append( line[ 1 ] )
                gyro_uncalib_data["x_uncalib"].append( line[ 2 ] )
                gyro_uncalib_data["y_uncalib"].append( line[ 3 ] )
                gyro_uncalib_data["z_uncalib"].append( line[ 4 ] )
                gyro_uncalib_data["x_drift"].append( line[ 5 ] )
                gyro_uncalib_data["y_drift"].append( line[ 6 ] )
                gyro_uncalib_data["z_drift"].append( line[ 7 ] )
                #print(gyro_uncalib_data)
            elif "TYPE_MAGNETIC_FIELD" in line:
                magnet_data["timestamp"].append( line[ 0 ] )
                magnet_data["sensor"].append( line[ 1 ] )
                magnet_data["x"].append( line[ 2 ] )
                magnet_data["y"].append( line[ 3 ] )
                magnet_data["z"].append( line[ 4 ] )
                #print(magnet_data)
            elif "TYPE_MAGNETIC_FIELD_UNCALIBRATED" in line:        
                magnet_uncalib_data["timestamp"].append( line[ 0 ] )
                magnet_uncalib_data["sensor"].append( line[ 1 ] )
                magnet_uncalib_data["x_uncalib"].append( line[ 2 ] )
                magnet_uncalib_data["y_uncalib"].append( line[ 3 ] )
                magnet_uncalib_data["z_uncalib"].append( line[ 4 ] )
                magnet_uncalib_data["x_bias"].append( line[ 5 ] )
                magnet_uncalib_data["y_bias"].append( line[ 6 ] )
                magnet_uncalib_data["z_bias"].append( line[ 7 ] )
                #print(magnet_uncalib_data)

    print("--- %s seconds ---" % (time.time() - start_time))

    return accel_data, accel_uncalib_data, gyro_data, gyro_uncalib_data, magnet_data, magnet_uncalib_data

How can I speed up my routine? I have tried many of the types mentioned on stackoverflow in similar cases but it didn´t work.

Many thanks in advance! :)

  • This may be better served at [code review stack exchange](https://codereview.stackexchange.com/). But I can't help but wonder if pyspark would better process such a large file for you. – Matt Cremeens May 26 '17 at 18:19
  • your code might have broke due to memory issue, in that case try to take a chunk of the data at a time to process. Like using any sampling techniques – Amey Yadav May 26 '17 at 18:19
  • 1
    I'm wondering if pandas would be a better option for you. – mauve May 26 '17 at 18:26
  • 1
    @mauve almost certainly [`pandas`](http://pandas.pydata.org/) is the solution; I agree with you. – roganjosh May 26 '17 at 18:36
  • You should check https://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python and https://stackoverflow.com/questions/14944183/python-fastest-way-to-read-a-large-text-file-several-gb – Jose Raul Barreras May 26 '17 at 19:34
  • @JoseRaulBarreras already tried the solutions from second post you listed here and it didn´t work. I will check the other one. – Florin-Catalin Grec May 26 '17 at 21:40
  • @mauve used pandas and it read the file in 30 seconds. I tried this: 'data = pd.read_csv('accelerometer.txt', sep=',', header = None)' 'data.columns = [ "timestamps", "type", "x", "y", "z" ]' – Florin-Catalin Grec May 26 '17 at 22:48
  • That sounds like a time improvement to me! :) – mauve May 30 '17 at 17:17

1 Answers1

0

As Amey Yadav has already suggested reading the text file in chunks is a rather efficient solution. In order to do that I can think of two solutions.

First off, I would suggest writing a generator that process the text in chunks. You can then process each chunk any way you want.

Secondly a rather handy python library that helps with large corpora is Gensim. Going through the tutorials you'll find that it's pretty easy to use it in order to load documents into its topic modelling software, in a way that doesn’t require you to load the whole file into memory thus taking you vastly less time to process your data.

kingJulian
  • 3,353
  • 2
  • 15
  • 28