3

I need to scan two large txt files (both about 100GB, 1 billion rows, several columns) and take out a certain column (write to new files). The files look like this

ID*DATE*provider
1111*201101*1234
1234*201402*5678
3214*201003*9012
...

My Python script is

N100 = 10000000   ## 1% of 1 billion rows
with open("myFile.txt") as f:
    with open("myFile_c2.txt", "a") as f2:
        perc = 0
        for ind, line in enumerate(f):   ## <== MemoryError
            c0, c1, c2  = line.split("*")
            f2.write(c2+"\n")
            if ind%N100 == 0: 
                print(perc, "%")
                perc+=1

Now the above script run well for one file but stuck for another one at 62%. The error message says MemoryError for for ind, line in enumerate(f):. I tried several times in different server with different RAM, the error is the same, all at 62%. I waited hours to monitor the RAM and it exploded to 28GB (total=32GB) when 62%. So I guess in that file there is a line that somehow too long (maybe not ended with \n ?) and thus Python stuck when trying reading it to the RAM.

So my question is, before I go to my data provider, what can I do to detect the error line and somehow get around/skip reading it as one huge line? Appreciate any suggestions!

EDIT:

The file, starting from the 'error line', might be all messed together with another line separator rather than \n. If that's the case, can I detect the line sep and continue extracting the columns I want, rather than throwing away them? Thanks!

Jason Lou
  • 53
  • 3
  • 1
    If you monitor the process for the first 62%, does memory use grow steadily? You might have a memory leak unrelated to line parsing. – Ry- Nov 08 '17 at 19:41
  • You can use the `readline` method on the file object, which accepts a maximum line length. – Asad Saeeduddin Nov 08 '17 at 19:41
  • Actually no. It's all fine until somewhere near 62% – Jason Lou Nov 08 '17 at 19:42
  • then drop the "by line" approach and read character by character, counting the "*" and the linefeeds. – Jean-François Fabre Nov 08 '17 at 19:46
  • 1
    @Jean-FrançoisFabre or some reasonable "chunk". – juanpa.arrivillaga Nov 08 '17 at 19:47
  • i vote for the yield approach like here https://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python – Harry Nov 08 '17 at 19:50
  • You don't need to drop to processing characters; as I said before, you can use `readline` to keep using lines but truncate long lines to a suitable length. – Asad Saeeduddin Nov 08 '17 at 19:50
  • @JasonLou To use a different character as a newline you may as well start processing the raw character stream, but perhaps you can repair your data by preprocessing the file and replacing the messed up character with `\n`? – Asad Saeeduddin Nov 08 '17 at 20:52
  • @AsadSaeeduddin what do you mean by "replacing the messed up character with `\n` "? – Jason Lou Nov 08 '17 at 22:16
  • @JasonLou From your edit, you say that there is a line where the newlines are missing, and are substituted with some other type of line separator character. Couldn't you just replace this non-standard line separator character with `\n`? – Asad Saeeduddin Nov 08 '17 at 22:29

2 Answers2

1

This (untested) code might solve your problem. It limits its input to 1,000,000 bytes per read, to reduce its maximum memory consumption.

Note that this code returns the first million characters from each line. There are other possibilities for how to deal with a long line:

  • return the first million characters
  • return the last million characters
  • skip the line entirely, optionally logging that, or
  • raise an exception.

 

#UNTESTED
def read_start_of_line(fp):
    n = int(1e6)
    tmp = result = fp.readline(n)
    while tmp and tmp[-1] != '\n':
        tmp = fp.readline(n)
    return result

N100 = 10000000   ## 1% of 1 billion rows
with open("myFile.txt") as f:
    with open("myFile_c2.txt", "a") as f2:
        perc = 0
        for ind, line in enumerate(iter(lambda: read_start_of_line(f), '')):
            c0, c1, c2  = line.split("*")
            f2.write(c2+"\n")
            if ind%N100 == 0:
                print(perc, "%")
                perc+=1
Robᵩ
  • 143,876
  • 16
  • 205
  • 276
  • Can you explain what does `while tmp and tmp[-1] != '\n':` do? It seems the 1,000,000 bytes will cut good line into pieces, then how can I get the column I want? Thanks – Jason Lou Nov 08 '17 at 20:10
  • what if c2 is at the end of the line? maybe keeping the _end_ of the line would be better. it depends on what's in the OP "big line" – Jean-François Fabre Nov 08 '17 at 20:10
  • @JasonLou - That expression controls the continuation of the `while` loop. It keeps reading, and throwing away data, until it finally reads a newline, i.e. until it finally reads the entire long line. – Robᵩ Nov 08 '17 at 20:16
  • @Jean-FrançoisFabre - You are right. My presumption is that `c2` is within the first 1,000,000 characters. I'll document that. – Robᵩ Nov 08 '17 at 20:17
  • since the line is split in 3 parts (unpacking requires that), if c2 is within the first 10000000 chars then it's truncated by your method. If it's in the last 10000000 chars there's a chance that c0 or/and c1 are truncated, but not c2 (else: unpack error). I would have read block by block and splitted manually. But that's a big effort to write such code. Interesting question, but tedious work at answering this. – Jean-François Fabre Nov 08 '17 at 20:18
  • @Jean-FrançoisFabre - You are right, returning the last million is guaranteed to be wrong. I wasn't looking at OP's use case, just thinking generally I'd want to throw away the trailing, not leading data. Hopefully OP can build on what I've got. – Robᵩ Nov 08 '17 at 20:21
  • you cannot know what's in the line, so your answer is as good at it can be. – Jean-François Fabre Nov 08 '17 at 20:23
  • @Robᵩ Yes yours should be the best answer to now. Can you see my new EDIT in the question? – Jason Lou Nov 08 '17 at 20:34
  • @JasonLou - I can't think of any way to recognize a new line separator on the fly. At that point, I think I'd open the file in my favorite text editor and try to fix the file manually. – Robᵩ Nov 08 '17 at 20:39
  • @Robᵩ Yes, and that's why I'm now contacting the data provider now... Thanks anyway. – Jason Lou Nov 08 '17 at 20:51
0

Specifying a maximum chunk size solves the problem of overflowing memory, while still allowing you to process the entire file. The following generator functions should help you do it:

def chunks(f, bufsize):
  while True:
    chunk = f.readline(bufsize)
    if not chunk:
      break
    yield chunk
    if chunk[-1] == "\n":
      break

def lines(path, bufsize):
  with open(path) as f:
    pos = -1
    while f.tell() > pos:
      pos = f.tell()
      c = chunks(f, bufsize)
      yield c
      for _ in c:
        pass

Here is an example of how to read only the first 20 characters from each line:

import itertools

for i, line in enumerate(lines("./core/scrape.js", 10)):
  print(i, end=": ")
  print(''.join(itertools.islice(line, 2)).rstrip())

Output looks something like:

0: /**
1:  * Document scraper/
2:  *
3:  * @author Branden H
4:  * @license MIT
5:  *
6:  */
7:
8: var promise = requir
9: var fs = promise.pro
10: var _ = require("lod
11: var util = require("
12: const path = require
Asad Saeeduddin
  • 43,250
  • 5
  • 81
  • 127