0

I have a lot of files(300~500) to read, and I want to accelerate this task.

The idealization is:

from multiprocessing import Pool
import os
import _io

filelist = map(open,os.listdir())
if __name__ == '__main__':
    with Pool() as pool:
        a = pool.map(_io.TextIOWrapper.read,filelist)

Of course, I got an error:

TypeError: cannot serialize '_io.TextIOWrapper' object

The question is: Can I accelerate I/O process by parallelism? If yes, how to?

UPDATE conclusion:

Now I get the way to parallelism and have tested my code:

I used 22 items, totalling 63.2 MB

from multiprocessing import Pool
import os
import _io

def my_read(file_name):
    with open(file_name) as f:
        return f.read()

def mul():
    with Pool() as pool:
        a = pool.map(my_read, os.listdir())

def single():
    a = []
    for i in os.listdir():
        with open(i) as f:
            r = f.read()
        a.append(r)

if __name__ == '__main__':
    mul()
    # single()

Sadly, single() costs 0.4s while mul() costs 0.8s.


UPDATE 1:

Some people said it's an IO-bound task so I can not improve it by parallelism。 However, I can find these words in Python doc:

However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

The full code is here:

My purpose is to transfer Epub to txt.

I have parallelized char2text and now I want to accelerate readall:

import zipfile
from multiprocessing import Pool

import bs4


def char2text(i):
    soup = bs4.BeautifulSoup(i)
    chapter = soup.body.getText().splitlines()
    chapter = "\n".join(chapter).strip() + "\n\n"
    return chapter


class Epub(zipfile.ZipFile):
    def __init__(self, file, mode='r', compression=0, allowZip64=False):
        zipfile.ZipFile.__init__(self, file, mode, compression, allowZip64)
        if mode == 'r':
            self.opf = self.read('OEBPS/content.opf').decode()
            opf_soup = bs4.BeautifulSoup(self.opf)
            self.author = opf_soup.find(name='dc:creator').getText()
            self.title = opf_soup.find(name='dc:title').getText()
            try:
                self.description = opf_soup.find(name='dc:description').getText()
            except:
                self.description = ''
            try:
                self.chrpattern = opf_soup.find(name='dc:chrpattern').getText()
            except:
                self.chrpattern = ''
            self.cover = self.read('OEBPS/images/cover.jpg')
        elif mode == 'w':
            pass

    def get_text(self):
        self.tempread = ""
        charlist = self.readall(self.namelist())
        with Pool() as pool:
            txtlist = pool.map(char2text, charlist)
        self.tempread = "".join(txtlist)
        return self.tempread

    def readall(self, namelist):
        charlist = []
        for i in namelist:
            if i.startswith('OEBPS/') and i.endswith('.xhtml'):
                r = self.read(i).decode()
                charlist.append(r)
        return charlist

    def epub2txt(self):
        tempread = self.get_text()
        with open(self.title + '.txt', 'w', encoding='utf8') as f:
            f.write(tempread)


if __name__ == "__main__":
    e = Epub("assz.epub")
    import cProfile
    cProfile.run("e.epub2txt()")
Community
  • 1
  • 1
PaleNeutron
  • 1,546
  • 3
  • 19
  • 38
  • 2
    The bottleneck of file IO is typically at the disk, not CPU, so simply make it parallel may not bring any speed-up (also it may make it slower sometimes). – starrify Oct 10 '14 at 07:35
  • @starrify I know that, but some people said that using multi-thread will improve python program limited by I/O because of the OS's cache. – PaleNeutron Oct 10 '14 at 07:37
  • @starrify I have updated something to my question. – PaleNeutron Oct 10 '14 at 07:55
  • 1
    @starrify: [if the files are in cache then multithreading *can* speed things up](http://stackoverflow.com/questions/25606833/fastest-way-to-sum-integers-in-text-file#comment40064167_25606833) – jfs Oct 10 '14 at 09:22
  • @J.F.Sebastian In that case, the multithreading is just speeding up the CPU-bound work, and the file cache is just making I/O fast enough that it doesn't dominate the CPU work, right? Or does the disk cache actually allow multi-threaded access? – dano Oct 10 '14 at 14:57
  • @dano: yes. multithreading is used only for CPU-bound work in the example (it is said in the description and the code is simple itself). But if file caches were cold then it would be pointless to use multithreading here because the disk is too slow compared to CPU – jfs Oct 10 '14 at 18:37
  • @J.F.Sebastian So in the OP's case, parallelizing the `read` calls for all the different files he wants to open isn't really going to improve performance (and may hurt it) unless all the files are already in the cache, right? And to the OP: when folks say threading is useful for "multiple I/O-bound tasks", they mean multiple tasks which *block* on I/O, like network calls or database queries. Reading from the local disk doesn't block, and also can't be done in parallel (putting aside the possibility of the file being in the disk cache), so threading normally doesn't speed it up. – dano Oct 10 '14 at 19:43
  • @dano: correct if the task is I/O bound. If each file requires many CPU cycles to process than processing them in parallel on multiple CPUs may improve performance. – jfs Oct 10 '14 at 19:54
  • Disks have latency - to this extent, parallel I/O can be faster. Too much parallelism will make intertrack seeks dominate intratrack seeks, and intertrack seeks are much slower than intratrack. – user1277476 Oct 11 '14 at 00:59
  • In this particular case multiprocessing buys you nothing over threads, since the GIL isn't held during a read() call. – Charles Duffy Oct 11 '14 at 01:21
  • @user1277476:I just used 4 process not "too much" but got not improvement as you can see in my question. So how to make it "faster". – PaleNeutron Oct 11 '14 at 01:23
  • @CharlesDuffy: `read` function was called in multiprocessing where should not have GIL, right? – PaleNeutron Oct 11 '14 at 01:25
  • @PaleNeutron, my point is that even if you _weren't_ using multiprocessing, the GIL still wouldn't be a problem, since it's released during IO, so you're using multiprocessing without need. So -- you're right that the GIL isn't a problem here, but it wouldn't have been a problem otherwise either. – Charles Duffy Oct 11 '14 at 01:28
  • @CharlesDuffy: Thanks, I knew that at the very beginning, my point is just "can parallelism improves IO" not "release the GIL". In this particular case, using multi-threads is the same. – PaleNeutron Oct 11 '14 at 01:36
  • @dano:I thought reading zipfile (which is much slower than simple file) is not a IO-bounded case, but I have no idea how to parallelize it, – PaleNeutron Oct 11 '14 at 02:12

1 Answers1

0

Did you try something like:

from multiprocessing import Pool
import os
import _io

def my_read(file_name):
    with open(file_name) as f:
        return _io.TextIOWrapper.read(f)


if __name__ == '__main__':
    with Pool() as pool:
        a = pool.map(my_read, os.listdir('some_dir'))

Is sounds more logical to me to open/close the file in the sub-process and string are easily serializable.

for your readall method try:

def readall(self, namelist):
    filter_func = lambda i: i.startswith('OEBPS/') and i.endswith('.xhtml')
    read_fun= lambda i:  self.read(i).decode()

    with Pool() as pool:
        a = pool.map(read_fun, filter(filter_func, namelist)) 
    return a
Antoine
  • 963
  • 6
  • 11
  • Thanks, it works but I still have no idea to edit my full code because the difference between `file` and `Zipfile`. Any suggestions? – PaleNeutron Oct 10 '14 at 08:02
  • @PaleNeutron: Unrelated: don't use `_io.TextIOWrapper`, use `io.open()` instead if you need it. – jfs Oct 10 '14 at 19:58