0

In my case, I have two csv file (file1 and file2).

To simplify my question, let's say that I want to read elements of file1, 3 by 3 and file2 4 by 4 consecutively.

file1.csv (9 line)

1,2,3
3,5,8
7,2,9
10,111,12
13,14,155
31,2,3
3,15,82
8,4,91
12,111,13

file2.csv (12 line)

55,12,17
3,6,13
72,1,91
10,0,12
1,1,73
31,2,3
3,15,61
18,6,91
13,33,13
7,1,15
9,17,42
41,8,18

in output i want to get:

1,2,3 (from 1. row of file1.csv)
3,5,8 (from 2. row of file1.csv)
7,2,9 (from 3. row of file1.csv)
55,12,17  (from 1. row of file2.csv)
3,6,13  (from 2. row of file2.csv)
72,1,91  (from 3. row of file2.csv)
10,0,12  (from 4. row of file2.csv)
10,111,12  (from 4. row of file1.csv)
13,14,155  (from 5. row of file1.csv)
31,2,3  (from 6. row of file1.csv)
1,1,73  (from 5. row of file2.csv)
31,2,3  (from 6. row of file2.csv)
3,15,61  (from 7. row of file2.csv)
18,6,91  (from 8. row of file2.csv)
3,15,82  (from 7. row of file1.csv)
8,4,91  (from 8. row of file1.csv)
12,111,13  (from 9. row of file1.csv)
13,33,13  (from 9. row of file2.csv)
7,1,15  (from 10. row of file2.csv)
9,17,42  (from 11. row of file2.csv)
41,8,18  (from 12. row of file2.csv)

My real data files are very big (~1,6 GB each of them) and I want to use less memory as much as possible. For this, I wrote a script:

f1, f2, = open(pathInput1, 'r'), open(pathInput2, 'r')
position1, position2 = 0, 0

for i in range(6):
    if i%2 == 0:
        #print("file1.csv")
        sizeOfWindow = 3
        sizeOfWindowInactive = 4
        f1.seek(position1)
        data = []
        for l in range(sizeOfWindow):
            line = f1.readline()
            line = list(map(int, line[:-1].split(",")))
            data.append(line)
        data = np.array(data)
        print(data)
        [next(f2) for i in range(sizeOfWindowInactive)]
        position1 = f1.tell()
    else:
        #print("file2.csv")
        sizeOfWindow = 4
        sizeOfWindowInactive = 3
        f2.seek(position2)
        data = []
        for l in range(sizeOfWindow):
            line = f2.readline()
            line = list(map(int, line[:-1].split(",")))
            data.append(line)
        data = np.array(data)
        print(data)
        [next(f1) for i in range(sizeOfWindowInactive)]
        position2 = f2.tell()

After writing this script, I noticed that I can't use both readline() and next(). Now my question is, how can I arrange my script to observe same output without using much memory.

Edit: In my real case, I have 5 files and each file has its own sizeOfWindow. Depending on data that I read, I decide to jump into files with an if statement. So The sizeOfWindow is fixed depending on files. I don't read files regularly. I decide the file to jump using last data part that I read.When I read a file, I need to move the cursor of other files without reading their data.

Mas A
  • 85
  • 7
  • Just to keep it complicted, you can't use `seek` and `tell` reliably on a non-binary file (the string decoder gets in the way). If you just want to interleave lines, you don't need this level of complication. – tdelaney May 03 '18 at 16:48
  • I don't see the problem. Each file descriptor has its own "bookmark". You read 3 lines from file1, then read 4 lines from file2. You're reading each file in order: there's no need for `seek, next,` or `tell`. Repeat until you run out of data in the files. – Prune May 03 '18 at 16:51
  • throw away seeks and peeks- just read 3 lines here, 4 lines there, repeat - maybe event better (ask the numpycracks) - read bot into different narrays and interleave them there. no idea how good thats memory wise - but if all you want to do is print them to console you do not need 80% of your code as it does not matter if you print them as string (as read from file) or parse/en-list them before printing ... – Patrick Artner May 03 '18 at 16:53
  • @tdelaney i found it from here, they use https://stackoverflow.com/questions/15594817/f-seek-and-f-tell-to-read-each-line-of-text-file – Mas A May 03 '18 at 17:22
  • @Prune no, using next and readline is a problem. "Combining next() method with other file methods like readline() does not work right. However, usingseek() to reposition the file to an absolute position will flush the read-ahead buffer." https://www.tutorialspoint.com/python/file_next.htm – Mas A May 03 '18 at 17:25
  • @Patricks Artner thank you but it is more complicated – Mas A May 03 '18 at 17:25
  • @SmA: I missed one: you throw away either `next` or `readline`, as well. – Prune May 03 '18 at 17:53
  • You say "When I read a file, I need to move the cursor of other files without reading their data." Why? Your posted desired output clearly reads all of the data in both files, in order. – Prune May 03 '18 at 18:11

1 Answers1

0

Since you only need to read the files sequentially, you can use next(f1) and next(f2) as needed to get the lines you want. The itertools module contains helpers that make this easier. itertools.islice will grab several lines so you don't need your own loop for next. And itertools.cycle will alternate items in a list so you don't need to track which file is next. Putting it together:

import itertools
import numpy as np

with open(pathInput1) as f1, open(pathInput2) as f2:
    grab_this = ((3, f1), (4, f2))
    for num, fp in itertools.cycle(grab_this):
        data = np.array(itertools.islice(fp, num))
        if not data:
            break
        print(data)
tdelaney
  • 55,698
  • 4
  • 59
  • 89
  • Can you please read my edit on top. Thank you but I already tried `islice()` and it spends much memory for me. Do you have another proposition? – Mas A May 03 '18 at 17:31
  • when i tried your code, it gives me an error: data = np.array(itertools.islice(fp, num)) ValueError: Stop argument for `islice()` must be None or an integer: 0 <= x <= sys.maxsize. (i've used `islice()` like this: `for line in islice(f1, sizeOfWindowInactive, None): pass`to skip unnecessary lines. As i said, i took more memory, i don't know why – Mas A May 03 '18 at 17:43
  • My mistake. That should have been `itertools.cycle`. The code was doing the equivalent of `islice((4, f2), (3, f1))` and that second parameter is definately not an integer! `islice(f1, sizeOfWindowInactive, None)` isn't what you want either - it consumes the entire iterator. I changed the code and it should work much better. – tdelaney May 03 '18 at 23:34