0

I want to pass a regex to pdfgrep using Python's subprocess module. The code executes without error, but pdfgrep is not receiving the argument properly. A test pdf is in the cwd and contains the string 'Mary Jane'. Here's my code (Python 3.6):

import subprocess
filtered = ['[A-Z].+Jane'] # the list of regexes is shortened to one string, to keep the example simple.
for regex in filtered: 
    arg = 'pdfgrep -PrH ' + f"{regex}"
    process_match = subprocess.run(arg, stdout=subprocess.PIPE, shell=True)

The expected result is that process_match would contain a CompletedProcess() object containing the match.

But instead, it returns the following:

CompletedProcess(args="pdfgrep -PrH '[A-Z].+Jane'", returncode=127, stdout=b'')

At the command line, invoking the same pdfgrep command finds the matching pdf. And I can do the task fairly trivially in Ruby with code like the following:

process_match = %x[pdfgrep -PrH "#{regex}"]

I'm new to python. What am I getting wrong when trying to pass the regex to the external command?

drewj
  • 1
  • 2
  • I'm not sure if it's a syntax thing I'm missing, but did you omit a space in the first code example, after `-PrH` and/or before the `{regex}`? Also, does this occur with other programs besides `pdfgrep`? (E.g. can you do it with `/bin/echo`, assuming you're on a UNIX-like system on which that program exists?) – David Z Feb 04 '17 at 05:06
  • @DavidZ You were right about a missing space: should be `'pdfgrep -PrH '` but this still doesn't lead to the expected result. Using `'/bin/echo '` does yield the expected behavior, Perhaps it's something to do with pdfgrep itself? – drewj Feb 04 '17 at 15:14
  • Your comment revealed to me that I should have specified the full path to pdfgrep. See my answer to my own question. – drewj Feb 04 '17 at 15:47

2 Answers2

3

subprocess.run expects a list for the argument (not a string), e.g.

arg = ['pdfgrep',  '-PrH',  f"{regex}"]

instead of arg = 'pdfgrep -PrH' + f"{regex}"

Edit:

Your comment that you should use a string when using shell=True is correct, but as discussed in the python subprocess documentation, there can be security implications with that, and it's seldom strictly necessary, so it's probably best to develop the habit of not using the shell.

Patrick Maupin
  • 7,331
  • 2
  • 21
  • 41
  • I was aware of passing the arguments as a list rather than a string. But I thought that with `shell=True` I needed a string. In any case, trying code like this `for regex in filtered: arg = ['pdfgrep', '-PrH', f"{regex}"] process_match = subprocess.run(arg, stdout=subprocess.PIPE, shell=True)` still gives me this value in process_match `CompletedProcess(args=['pdfgrep', '-PrH', '[A-Z].+Jane'], returncode=127, stdout=b'')` So pdfgrep still doesn't seem to be getting the regex correctly. – drewj Feb 04 '17 at 14:55
  • You were right. But I also needed to specify the full path to pdfgrep. I've incorporated your advice into my own answer. Thanks. – drewj Feb 04 '17 at 15:52
0

The following code works as expected:

for regex in filtered:
    arg = ['/usr/local/bin/pdfgrep',  '-PrH',  f"{regex}"]
    process_match = subprocess.run(arg, stdout=subprocess.PIPE)

My original code had (at least) two problems. First, I needed to pass the command to subprocess.run as a list, but for that to work, I needed to specify the full path to pdfgrep.

drewj
  • 1
  • 2
  • This is confusing to me (and probably why your question didn't draw an immediate answer), because /usr/local/bin is usually in the PATH environment variable, and most implementations of Python will search the path passed to the Python interpreter. Do you have a restricted path for some reason? – Patrick Maupin Feb 04 '17 at 16:17
  • @Patrick I would have thought that, but I've found that it's not actually universal that `/usr/local/bin` is in the path. Quite a few configurations do omit it by default. – David Z Feb 04 '17 at 22:48
  • @DavidZ -- sure, but OP claims it works otherwise, e.g. manually or with Ruby. So is the Python interpreter somehow set up to not use the path, or is the Ruby interpreter doing some magic? – Patrick Maupin Feb 05 '17 at 05:26
  • @Patrick There's really not enough information to tell. Python does use the path inherited from the parent process, sort of; see [this question](https://stackoverflow.com/questions/5658622/python-subprocess-popen-environment-path) for more details. I don't know anything about how Ruby does it. – David Z Feb 05 '17 at 06:39