writing large amount of data to stdin

Question

I am writing a large amount of data to stdin.

How do i ensure that it is not blocking?

p=subprocess.Popen([path],stdout=subprocess.PIPE,stdin=subprocess.PIPE)
p.stdin.write('A very very very large amount of data')
p.stdin.flush()
output = p.stdout.readline()

It seems to hang at p.stdin.write() after i read a large string and write to it.

I have a large corpus of files which will be written to stdin sequentially(>1k files)

So what happens is that i am running a loop

#this loop is repeated for all the files
for stri in lines:
p=subprocess.Popen([path],stdout=subprocess.PIPE,stdin=subprocess.PIPE)
p.stdin.write(stri)
output = p.stdout.readline()
#do some processing

It somehow hangs at file no. 400. The file is a large file with long strings.

I do suspect its a blocking issue.

This only happens if i iterate from 0 to 1000. However, if i were to start from file 400, the error would not happen

Looks like you need to [`select`](https://docs.python.org/2/library/select.html#select.select). — Dan D., Sep 01 '15 at 01:02
Do you want to avoid it blocking at all, or just worried about a deadlock when the process's output fills up the `stdout` pipe before you finish writing to the `stdin` pipe? I think `p.communicate` will fix the deadlock, but it still will block until all the input has been sent (it just uses threads to buffer in memory whatever is coming back at the same time). — Blckknght, Sep 01 '15 at 01:13
@DanD. Hi could you explain further. I do not seem to get the documentation — aceminer, Sep 01 '15 at 01:14
What does the program you're running print out its stdout? You seem to be reading a single line back for each line you write, but could the program be printing more than that? Similarly, does it read your full input line before starting to write its response line, or does it work on shorter bits of data (e.g. byte by byte)? — Blckknght, Sep 01 '15 at 01:22
Yes i am writing a single string to an engine to be processed and waiting for the output before moving on to the next line. It waits for the whole newline to be processed before moving on or i should say it has to read the full string before any processing can be done — aceminer, Sep 01 '15 at 01:27
@aceminer: I understand, my question was about the program on the other end. Does it sometimes return two or more lines for a single input line, or is it guaranteed to only ever return one for one? Similarly, does it operate on the input as a byte stream, or does it buffer the whole line you're sending before starting to respond? If the former of either of these, you'll probably need to use threading or some similar situation (maybe `p.communicate` with a timeout if you're using Python 3.3+) on your end to make sure you're reading the response at the same time you're writing. — Blckknght, Sep 01 '15 at 01:33
@Blckknght The program on the other end also returns only a single string at any one time. The string from my python program will be encoded in utf8 and likewise the return string from the program will also be encoded in utf8. I am using python 2.7.9 — aceminer, Sep 01 '15 at 01:35
@aceminer I've added a threaded solution from which you can start out if you haven't yet done something like this before. The threaded hack is easier to implement than the one that uses `select` and works on windows too. — pasztorpisti, Sep 01 '15 at 03:03

score 3 · Accepted Answer · edited May 23 '17 at 12:29

To avoid the deadlock in a portable way, write to the child in a separate thread:

#!/usr/bin/env python
from subprocess import Popen, PIPE
from threading import Thread

def pump_input(pipe, lines):
    with pipe:
        for line in lines:
            pipe.write(line)

p = Popen(path, stdin=PIPE, stdout=PIPE, bufsize=1)
Thread(target=pump_input, args=[p.stdin, lines]).start()
with p.stdout:
    for line in iter(p.stdout.readline, b''): # read output
        print line,
p.wait()

See Python: read streaming input from subprocess.communicate()

pasztorpisti · Answer 2 · 2015-09-01T02:59:15.663

You may have to use Popen.communicate().

If you write a large amount of data to the stdin and during this the child process generates output to stdout then it may become a problem that the stdout buffer of the child becomes full before processing all of your stdin data. The child process blocks on a write to stdout (because you are not reading it) and you are blocked on writing the stdin.

Popen.communicate() can be used to write stdin and read stdout/stderr at the same time to avoid the previous problem.

Note: Popen.communicate() is suitable only when the input and output data can fit to your memory (they are not too large).

Update: If you decide to hack around with threads here is an example parent and child process implementation that you can tailor to suit your needs:

parent.py:

#!/usr/bin/env python2
import os
import sys
import subprocess
import threading
import Queue


class MyStreamingSubprocess(object):
    def __init__(self, *argv):
        self.process = subprocess.Popen(argv, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
        self.stdin_queue = Queue.Queue()
        self.stdout_queue = Queue.Queue()
        self.stdin_thread = threading.Thread(target=self._stdin_writer_thread)
        self.stdout_thread = threading.Thread(target=self._stdout_reader_thread)
        self.stdin_thread.start()
        self.stdout_thread.start()

    def process_item(self, item):
        self.stdin_queue.put(item)
        return self.stdout_queue.get()

    def terminate(self):
        self.stdin_queue.put(None)
        self.process.terminate()
        self.stdin_thread.join()
        self.stdout_thread.join()
        return self.process.wait()

    def _stdin_writer_thread(self):
        while 1:
            item = self.stdin_queue.get()
            if item is None:
                # signaling the child process that the end of the
                # input has been reached: some console progs handle
                # the case when reading from stdin returns empty string
                self.process.stdin.close()
                break
            try:
                self.process.stdin.write(item)
            except IOError:
                # making sure that the current self.process_item()
                # call doesn't deadlock
                self.stdout_queue.put(None)
                break

    def _stdout_reader_thread(self):
        while 1:
            try:
                output = self.process.stdout.readline()
            except IOError:
                output = None
            self.stdout_queue.put(output)
            # output is empty string if the process has
            # finished or None if an IOError occurred
            if not output:
                break


if __name__ == '__main__':
    child_script_path = os.path.join(os.path.dirname(__file__), 'child.py')
    process = MyStreamingSubprocess(sys.executable, '-u', child_script_path)
    try:
        while 1:
            item = raw_input('Enter an item to process (leave empty and press ENTER to exit): ')
            if not item:
                break
            result = process.process_item(item + '\n')
            if result:
                print('Result: ' + result)
            else:
                print('Error processing item! Exiting.')
                break
    finally:
        print('Terminating child process...')
        process.terminate()
        print('Finished.')

child.py:

#!/usr/bin/env python2
import sys

while 1:
    item = sys.stdin.readline()
    sys.stdout.write('Processed: ' + item)

Note: IOError is processed on the reader/writer threads to handle the cases where the child process exits/crashes/killed.

@paszatorpisti Yes that is what i suspect is the issue. How do i resolve this? Popen.communicate() will not work as i am writing repeatedly. Hence, i cannot afford to open and close the process everytime i need to process a string — aceminer, Sep 01 '15 at 01:17
@aceminer If you are streaming input/output on the fly then you need `select` as Blckknght recommended. However `select` has portability issues (it often sucks on windows). For this reason long ago I've written a multithreaded hack (I wasn't proud of it) on windows to overcome this issue (reading stdout and writing stdin on separate threads). Actually the python `select` doc says that `select` works only with sockets on windows... — pasztorpisti, Sep 01 '15 at 01:21
@aceminer also make sure that the stdout buffering of the child process is turned off. Otherwise you might end up in a situation where you are blocked on reading till the end of the line and the child process has the end of the line buffered. — pasztorpisti, Sep 01 '15 at 01:26
Are you talking about turning off the stdout buffering on my stdout? Sorry i am very new to this — aceminer, Sep 01 '15 at 01:28
@aceminer I referred to the implementation of the program you launch as a child process. You may hang up if the child process is buffering the output while you are waiting for a whole line to come out from its stdout. This bug happend to me only once where I captured the output of a C++ app that was implemented in a little bit weird way. Actually you don't have to worry about this if your app doesn't mess with stdout and uses it the standard, default way. — pasztorpisti, Sep 01 '15 at 01:34
If you are launching another python application (for example with `[sys.executable, my_python_script_path]` then make sure you are passing the `-u` parameter to the python interpreter like: `[sys.executable, '-u', my_python_script_path]`. I remembered that there is a trick that is needed and I've just tried it on windows with python 2.7.9 and -u was necessary. — pasztorpisti, Sep 01 '15 at 01:48
@aceminer The source to the `subprocess` module should be available, you could try to retrieve the implementation of `communicate` and modify it according to your needs. — skyking, Sep 01 '15 at 05:38

writing large amount of data to stdin

2 Answers2

Linked

Related