Can I automate the process of downloading from an URL?

Question

What I want to do is to open links like this one:

http://link.springer.com/openurl?genre=book&isbn=978-0-306-48048-5

There is a page of a book and I want to download it, in order to do that it is needed to press the Download book PDF button, that opens the book in browser and from there to finally download it.

As u can see there are more steps:

need to get the titles with the links from an excel file
need to open the links, then get the link from the button in order to get to the pdf version of the book
need to save the book from the pdf file to the computer.

I found tutorials for the steps 1 and 2, so I think I will manage here, but nothing for step number 2. Can u help me? I accept suggetions for other steps too.

I want to do this in Java, the second option would be C/C++, and the third python. I dont need a full complet code, just libraries, ideas, examples of code.

Thank you!

I want to mention that i want to run the script and then to have the books downloaded, i dont want the program to open any browser files, just to access them to get the information. — Darius, Apr 29 '20 at 22:06

linoskoczek · Answer 1 · 2020-04-29T23:05:03.073

You don't really need to open this book in a browser. The only thing you need is to copy a link which is assigned to a button. As it is not added by JavaScript, it can be simply extracted from the source code of a website.

What you really need is an answer from here: How to send HTTP request in java?

So the steps would be:

Download a website from an URL you have.
Find an URL to a PDF with a book on downloaded website.
Download a file by using an URL you got.

To extract the link from downloaded website you might use regex's. There is already a post about that.

To download a file in Java you can do it like that:

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;

class Main {
    public static void downloadFile(URL url, String outputFileName) throws IOException {
        try(InputStream in = url.openStream();
            ReadableByteChannel rbc = Channels.newChannel(in);
            FileOutputStream fos = new FileOutputStream(outputFileName))
        {
            fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
        }
    }

    public static void main(String[] args) throws Exception {
        // call to downloadFile() method
    }
}

I took this solution from here.

This might seem a bit other thing than clicking on a button, but this is the way it should be done, because it is much faster. If you automate these actions using browser, you would lose more resources, because you would have to run a separate instance of for example Firefox. Loading times would consume bigger amount of time, together with downloading and rendering all the graphics and stuff not related to what you eventually want. Moreover, downloading a file using the Download manager built-in browser could be tough. Anyway, if you would like to learn some more techniques that let you interact with website elements directly, take a look at Selenium.

This example will be in BASH, but I write it just to give you an idea about downloading the website and extracting the values from it. You can do it in Java or any other language.

Download a website:

wget -O website.html http://link.springer.com/openurl\?genre\=book\&isbn\=978-0-306-48048-5

I wanted to find some unique thing which would identify a button and for example I took data-track-action="Book download - pdf" attribute. I used it while looking through downloaded website:

cat website.html | grep 'data-track-action="Book download - pdf"'                          
        <a href="/content/pdf/10.1007%2Fb100747.pdf" target="_blank" class="c-button c-button--blue c-button__icon-right test-download-book-options test-bookpdf-link" title="Download this book in PDF format" rel="noopener" data-track="click" data-track-action="Book download - pdf" data-track-label="">
        <a href="/content/pdf/10.1007%2Fb100747.pdf" target="_blank" class="c-button c-button--blue c-button__icon-right test-download-book-options test-bookpdf-link" title="Download this book in PDF format" rel="noopener" data-track="click" data-track-action="Book download - pdf" data-track-label="">

As you can see, there are 2 lines of output. They are same, so a dirty fix for that would be uniq command:

cat website.html | grep 'data-track-action="Book download - pdf"' | uniq
        <a href="/content/pdf/10.1007%2Fb100747.pdf" target="_blank" class="c-button c-button--blue c-button__icon-right test-download-book-options test-bookpdf-link" title="Download this book in PDF format" rel="noopener" data-track="click" data-track-action="Book download - pdf" data-track-label="">

You could simply take only the first line as well.

Now, using some regex, the path to PDF can be extracted:

cat website.html | grep 'data-track-action="Book download - pdf"' | uniq | grep -o '\/content.*\.pdf'
/content/pdf/10.1007%2Fb100747.pdf

To connect the domain we can make an echo with part of the link, and wrap another part with $() to evaluate the commands:

echo "https://link.springer.com"$(cat website.html | grep 'data-track-action="Book download - pdf"' | uniq | grep -o '\/content.*\.pdf') 
https://link.springer.com/content/pdf/10.1007%2Fb100747.pdf

And to use the result in wget, we can do something like that:

wget $(echo "https://link.springer.com"$(cat website.html | grep 'data-track-action="Book download - pdf"' | uniq | grep -o '\/content.*\.pdf'))

So the final code would be these 2 lines:

wget -O website.html "http://link.springer.com/openurl?genre=book&isbn=978-0-306-48048-5"
wget $(echo "https://link.springer.com"$(cat website.html | grep 'data-track-action="Book download - pdf"' | uniq | grep -o '\/content.*\.pdf'))

Can I automate the process of downloading from an URL?

1 Answers1