1

I have a lot of .txt files, which together form a dataframe that is too much to get loaded into a variable (ergo there's not enough memory to load all the files into a pandas dataframe). Can I somehow get some descriptive statistics by just reading the files but not loading them into a dataframe/variable? How? Thank you!

lte__
  • 5,472
  • 13
  • 55
  • 106
  • 1
    You can iterate over the files and append the result of `df.describe()` for each file. That way you wouldn't need to load each file and keep them in memory – EdChum Oct 14 '16 at 12:44
  • 1
    Use [online statistical algorithms](http://stackoverflow.com/q/1058813/190597). – unutbu Oct 14 '16 at 12:54

1 Answers1

0

In order to get information, you can select the files with glob, open them as text files. Assuming this is a CSV file with column titles on the first line, you can retrieves the keys by splitting the first line. Based on How to get line count cheaply in Python?, count the remaining lines.

import glob    

filenames = glob.glob('*.txt')
for filename in filenames:
    with open(filename) as f:
        keys = f.readline().rstrip().split(',')
        for i, l in enumerate(f):
            pass
    print("File:", filename, " keys:", keys," len:",i+1)
Nico7as
  • 76
  • 7