Just in case it might help someone:
I was struggling with the same problem (same https://github.com/tylin/coco-caption code). Might be relevant to say that I was running the code with python 3.7 on CentOS using qsub
. So I changed
cmd = ['java', '-cp', 'stanford-corenlp-3.4.1.jar', 'edu.stanford.nlp.process.PTBTokenizer', '-preserveLines', '-lowerCase', 'tmpWS5p0Z']
to
cmd = ['/abs/path/to/java -cp /abs/path/to/stanford-corenlp-3.4.1.jar edu.stanford.nlp.process.PTBTokenizer -preserveLines -lowerCase ', ' /abs/path/to/temporary_file']
Using absolute paths fixed the OSError: [Errno 2] No such file or directory
. Notice that I still put '/abs/path/to/temporary_file'
as second element in the cmd
list, because it got added later on. But then something went wrong in the tokenizer java subprocess, I don't know why or what, just observing because:
p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname, stdout=subprocess.PIPE, shell=True)
token_lines = p_tokenizer.communicate(input=sentences.rstrip())[0]
Here token_lines
was an empty list (which is not the wanted behavior). Executing this in IPython resulted in the following (just the subprocess.Popen(...
, not the communicate
).
Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: java.io.IOException: Input/output error
at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:278)
at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:163)
at edu.stanford.nlp.process.AbstractTokenizer.hasNext(AbstractTokenizer.java:55)
at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:444)
at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:416)
at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:760)
Caused by: java.io.IOException: Input/output error
at java.base/java.io.FileInputStream.readBytes(Native Method)
at java.base/java.io.FileInputStream.read(FileInputStream.java:279)
at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:290)
at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:351)
at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
at java.base/java.io.BufferedReader.read1(BufferedReader.java:210)
at java.base/java.io.BufferedReader.read(BufferedReader.java:287)
at edu.stanford.nlp.process.PTBLexer.zzRefill(PTBLexer.java:24511)
at edu.stanford.nlp.process.PTBLexer.next(PTBLexer.java:24718)
at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:276)
... 5 more
Again, I don't know why or what, but I just wanted to share that doing this fixed it:
cmd = ['/abs/path/to/java -cp /abs/path/to/stanford-corenlp-3.4.1.jar edu.stanford.nlp.process.PTBTokenizer -preserveLines -lowerCase /abs/path/to/temporary_file']
And changing cmd.append(os.path.join(path_to_jar_dirname, os.path.basename(tmp_file.name)))
into cmd[0] += os.path.join(path_to_jar_dirname, os.path.basename(tmp_file.name))
.
So making cmd
into a list with only 1 element, containing the entire command with absolute paths at once. Thanks for your help!