You don't really need to open this book in a browser. The only thing you need is to copy a link which is assigned to a button. As it is not added by JavaScript, it can be simply extracted from the source code of a website.
What you really need is an answer from here: How to send HTTP request in java?
So the steps would be:
- Download a website from an URL you have.
- Find an URL to a PDF with a book on downloaded website.
- Download a file by using an URL you got.
To extract the link from downloaded website you might use regex's. There is already a post about that.
To download a file in Java you can do it like that:
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
class Main {
public static void downloadFile(URL url, String outputFileName) throws IOException {
try(InputStream in = url.openStream();
ReadableByteChannel rbc = Channels.newChannel(in);
FileOutputStream fos = new FileOutputStream(outputFileName))
{
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
}
}
public static void main(String[] args) throws Exception {
// call to downloadFile() method
}
}
I took this solution from here.
This might seem a bit other thing than clicking on a button, but this is the way it should be done, because it is much faster. If you automate these actions using browser, you would lose more resources, because you would have to run a separate instance of for example Firefox. Loading times would consume bigger amount of time, together with downloading and rendering all the graphics and stuff not related to what you eventually want. Moreover, downloading a file using the Download manager built-in browser could be tough. Anyway, if you would like to learn some more techniques that let you interact with website elements directly, take a look at Selenium.
This example will be in BASH, but I write it just to give you an idea about downloading the website and extracting the values from it. You can do it in Java or any other language.
Download a website:
wget -O website.html http://link.springer.com/openurl\?genre\=book\&isbn\=978-0-306-48048-5
I wanted to find some unique thing which would identify a button and for example I took data-track-action="Book download - pdf"
attribute. I used it while looking through downloaded website:
cat website.html | grep 'data-track-action="Book download - pdf"'
<a href="/content/pdf/10.1007%2Fb100747.pdf" target="_blank" class="c-button c-button--blue c-button__icon-right test-download-book-options test-bookpdf-link" title="Download this book in PDF format" rel="noopener" data-track="click" data-track-action="Book download - pdf" data-track-label="">
<a href="/content/pdf/10.1007%2Fb100747.pdf" target="_blank" class="c-button c-button--blue c-button__icon-right test-download-book-options test-bookpdf-link" title="Download this book in PDF format" rel="noopener" data-track="click" data-track-action="Book download - pdf" data-track-label="">
As you can see, there are 2 lines of output. They are same, so a dirty fix for that would be uniq
command:
cat website.html | grep 'data-track-action="Book download - pdf"' | uniq
<a href="/content/pdf/10.1007%2Fb100747.pdf" target="_blank" class="c-button c-button--blue c-button__icon-right test-download-book-options test-bookpdf-link" title="Download this book in PDF format" rel="noopener" data-track="click" data-track-action="Book download - pdf" data-track-label="">
You could simply take only the first line as well.
Now, using some regex, the path to PDF can be extracted:
cat website.html | grep 'data-track-action="Book download - pdf"' | uniq | grep -o '\/content.*\.pdf'
/content/pdf/10.1007%2Fb100747.pdf
To connect the domain we can make an echo with part of the link, and wrap another part with $()
to evaluate the commands:
echo "https://link.springer.com"$(cat website.html | grep 'data-track-action="Book download - pdf"' | uniq | grep -o '\/content.*\.pdf')
https://link.springer.com/content/pdf/10.1007%2Fb100747.pdf
And to use the result in wget, we can do something like that:
wget $(echo "https://link.springer.com"$(cat website.html | grep 'data-track-action="Book download - pdf"' | uniq | grep -o '\/content.*\.pdf'))
So the final code would be these 2 lines:
wget -O website.html "http://link.springer.com/openurl?genre=book&isbn=978-0-306-48048-5"
wget $(echo "https://link.springer.com"$(cat website.html | grep 'data-track-action="Book download - pdf"' | uniq | grep -o '\/content.*\.pdf'))