Loading / Streaming 8GB txt file?? And tokenize

Question

I have a pretty large file (about 8 GB).. now I read this post: How to read a large file line by line and this one Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors

But this still doesnt do the job.. when I run my code, my pc gets stuck. Am I doing something wrong?

I want to get all words into a list (tokenize them). Further, doesnt the code reads each line and tokenizes the line? Doesnt this might prevent the tokenizer from tokenizing words properly since some words (and sentences) do not end after just one line?

I considered splitting it up into smaller files, but doesnt this still consume my RAM if I just have 8GB Ram since the list of words will probably be equally big (8GB) like the initial txt file?

word_list=[]
number = 0
with open(os.path.join(save_path, 'alldata.txt'), 'rb',encoding="utf-8") as t:
    for line in t.readlines():
        word_list+=nltk.word_tokenize(line)
        number = number + 1
        print(number)

See https://github.com/nltk/nltk/pull/2337 =) – alvas Jul 18 '19 at 05:56 — alvas, Jul 18 '19 at 05:56

David Culbreth · Accepted Answer · 2019-07-17T20:17:40.630

By using the following line:

for line in t.readlines():
    # do the things

You are forcing python to read the whole file with t.readlines(), then return an array of strings that represents the whole file, thus bringing the whole file into memory.

Instead, if you do as the example you linked states:

for line in t:
    # do the things

The Python VM will natively process the file line-by-line, like you want. the file will act like a generator, yielding each line one at a time.

After looking at your code again, I see that you are constantly appending to the word list, with word_list += nltk.word_tokenize(line). This means that even if you do import the file one line at a time, you are still retaining that data in your memory, even after the file has moved on. You will likely need to find a better way of doing whatever this is, as you will still be consuming massive amounts of memory, because the data has not been dropped from memory.

For data this large, you will have to either

find a way to store an intermediate version of your tokenized data, or
design your code in a way that you can handle one, or just a few tokenized words at a time.

Some thing like this might do the trick:

def enumerated_tokens(filepath):
    index = 0
    with open(filepath, rb, encoding="utf-8") as t:
        for line in t:
            for word in nltk.word_tokenize(line):
                yield (index, word)
                index += 1

for index, word in enumerated_tokens(os.path.join(save_path, 'alldata.txt')):
    print(index, word)
    # Do the thing with your word.

Notice how this never actually stores the word anywhere. This doesn't mean that you can't temporarily store anything, but if you're memory constrained, generators are the way to go. This approach will likely be faster, more stable, and use less memory overall.

Ok, I see.. for files that large, a list of words cannot be stored into the pc's memory. I need to write them to an extra file and pass the words through a generator. So for further analysis, I need to reload the file, iterate over it (again, if not done so the first time) and store it to (another) file. - So just in case I need to numerate each sentence (and a sentence is unlikely to end after a line), how do I read the file till a dot(.) or something occurs? — Felix, Jul 17 '19 at 20:35
The approach I suggested above still only uses the original file, and makes a generator from the lines of the original file, which you can use to process each word independently. However, if you need to have all of the words in a list all at the same time, you will likely need to make a file that is the (more easily parse-able) list of those words, before you do the rest of your processing. — David Culbreth, Jul 17 '19 at 20:38

Loading / Streaming 8GB txt file?? And tokenize

1 Answers1