12

I am trying to write a bash script to merge all pdf files of a directory into one single pdf file. The command pdfunite *.pdf output.pdf successfully achieves this but it merges the input documents in a regular order:

1.pdf
10.pdf
11.pdf
2.pdf
3.pdf
4.pdf
5.pdf
6.pdf
7.pdf
8.pdf
9.pdf

while I'd like the documents to be merged in a numerical order:

1.pdf
2.pdf
3.pdf
4.pdf
5.pdf
6.pdf
7.pdf
8.pdf
9.pdf
10.pdf
11.pdf

I guess a command mixing ls -v or sort -n and pdfunite would do the trick but I don't know how to combine them. Any idea on how I could merge pdf files with a numerical sort?

Benjamin W.
  • 33,075
  • 16
  • 78
  • 86
max
  • 389
  • 4
  • 10

4 Answers4

36

you can embed the result of command using $(), so you can do following

$ pdfunite $(ls -v *.pdf) output.pdf

or

$ pdfunite $(ls *.pdf | sort -n) output.pdf

However, note that this does not work when filename contains special character such as whitespace.

In the case you can do the following:

ls -v *.txt | bash -c 'IFS=$'"'"'\n'"'"' read -d "" -ra x;pdfunite "${x[@]}" output.pdf'

Although it seems a little bit complicated, its just combination of

Note that you cannot use xargs since pdfunite requires input pdf's as the middle of arguments. I avoided using readarray since it is not supported in older bash version, but you can use it instead of IFS=.. read -ra .. if you have newer bash.

Community
  • 1
  • 1
ymonad
  • 10,300
  • 1
  • 31
  • 44
  • Thank you so much! I confirm solution 1 & 2 work but I couldn't get solution 3 (xargs) to work. I think pdfunite is not recognizing the inputs. Could you explain your solution 3 in details? – max May 14 '14 at 01:43
  • sorry `xargs -I {}` only can apply argument one by one. forget it, I would write new answer. – ymonad May 14 '14 at 01:58
  • Yes, that answer is actually a little bit wrong (second `sh` string is garbage) , so I updated the answer and wrote the correct one – ymonad May 14 '14 at 02:25
  • Sorry again, I'm fool. I forgot to sort. `find` cannot sort. I would update answer again. – ymonad May 14 '14 at 02:32
  • finally seems to got a working answer. hope this will be the final version. – ymonad May 14 '14 at 03:16
  • Thank you very much for your time ymonad! I wouldn't have never figure it out by myself. Just on thing about your final command `ls -v *.txt | bash -c 'IFS=$'"'"'\n'"'"' read -d "" -ra x;pdfunite "${x[@]}" output.pdf'` you need to replace *.txt by *.pdf – max May 14 '14 at 10:19
  • @ymonad Can tihs case be extended to the numerical order and modified time order? Thread here http://unix.stackexchange.com/q/332348/16920 – Léo Léopold Hertz 준영 Dec 23 '16 at 09:36
  • Try `sort -V` instead. – Kevin Dong Feb 03 '17 at 08:38
  • Note that if you're trying to merge many files (I was attempting to merge > 2k files), you may get bash "argument list too long" errors. I posted a solution [here](https://stackoverflow.com/a/44528538/426790) that uses Python to iterate over a directory's PDFs files in "natural order" and merge them into one file. – Greg Sadetsky Jun 13 '17 at 18:07
0

Do it in multiple steps. I am assuming you have files from 1 to 99.

 pdfunite $(find ./ -regex ".*[^0-9][0-9][^0-9].*"  | sort) out1.pdf
 pdfunite out1.pdf $(find ./ -regex ".*[^0-9]1[0-9][^0-9].*"  | sort) out2.pdf
 pdfunite out2.pdf $(find ./ -regex ".*[^0-9]2[0-9][^0-9].*"  | sort) out3.pdf

and so on.

the final file will consist of all your pdfs in numerical order.

!!! Beware of writing the output file such as out1.pdf etc. otherwise pdfunite will overwrite the last file !!!

Edit: Sorry I was missing the [^0-9] in each regex. Corrected it in the above commands.

infoclogged
  • 2,639
  • 2
  • 21
  • 43
  • thanks for the tip but it doesn't sort correctly. If you merge `1.pdf, 2.pdf, 11.pdf`, the order will be `11.pdf, 1.pdf, 2.pdf`. Changing `sort` to `sort -n` doesn't fix the problem – max Feb 20 '16 at 11:48
  • thanks and I corrected the answer. Also, want to say that the command above is not generic, but covers most of the human genreated files. – infoclogged Feb 21 '16 at 14:01
  • thanks for this correction but it doesn't work yet. `pdfunite $(find ./ -regex ".*[^0-9][0-9][^0-9].*" | sort) out1.pdf` generates `out1.pdf` that includes 1.pdf and 2.pdf only (not 11.pdf) – max Feb 22 '16 at 16:40
  • you have to follow other commands as well. Run the next line and you will get 11.pdf. If you see carefully, in the second line the input is out1.pdf and the output is out2.pdf. Even the regex is slightly different. – infoclogged Feb 22 '16 at 17:24
0

You can rename your documents i.e. 001.pdf 002.pdf and so on.

jcgatti
  • 11
0
destfile=combined.pdf
find . -maxdepth 1 -type f -name '*.pdf' -print0 \
   | sort -z -t '/' -k2n \
   | { cat; printf '%s\0' "$destfile"; } \
   | xargs -0 -x pdfunite
  1. Variable destfile holds the name of the destination pdf file.
  2. The find command finds all the pdf files in the current directory and outputs them as a NUL delimited list.
  3. The sort command reads the NUL delimited list of filenames. It specifies a field delimiter of /. It sorts by the 2nd field numerically. (Recall that the output of find looks like ./11.pdf ....)
  4. We append destfile before sending to xargs, being sure to end it with a NUL.
  5. xargs reads the NUL delimited args and supplies them to the pdfunite command. We supplied the -x option so that xargs will exit if the command length is too long. We don't want xargs to execute a partially constructed command.

This solution handles filenames with embedded newlines and spaces.

Robin A. Meade
  • 1,003
  • 10
  • 12