13

I am trying to fetch base URL using java. I have used jtidy parser in my code to fetch the title. I am getting the title properly using jtidy, but I am not getting the base url from the given URL.

I have some URL as input:

String s1 = "http://staff.unak.is/andy/GameProgramming0910/new_page_2.htm";
String s2 = "http://www.complex.com/pop-culture/2011/04/10-hottest-women-in-fast-and-furious-movies";

From the first string, I want to fetch "http://staff.unak.is/andy/GameProgramming0910/" as a base URL and from the second string, I want "http://www.complex.com/" as a base URL.

I am using code:

URL url = new URL(s1);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
InputStream in = conn.getInputStream();
Document doc = new Tidy().parseDOM(in, null);
String titleText = doc.getElementsByTagName("title").item(0).getFirstChild()
.getNodeValue();

I am getting titletext, but please can let me know how to get base URL from above given URL?

Matthew Murdoch
  • 28,946
  • 26
  • 89
  • 125
DJ31
  • 1,199
  • 3
  • 13
  • 19
  • 8
    What rules would tell you that `http://www.complex.com/` is the base url and not `http://www.complex.com/pop-culture/2011/04/`? – Joachim Sauer May 16 '11 at 06:28

2 Answers2

25

Try to use the java.net.URL class, it will help you:

For the second case, that it is easier, you could use new URL(s2).getHost();

For the first case, you could get the host and also use getFile() method, and remove the string after the last slash ("/"). something like: (code not tested)

URL url = new URL(s1);
String path = url.getFile().substring(0, url.getFile().lastIndexOf('/'));
String base = url.getProtocol() + "://" + url.getHost() + path;
mheinzerling
  • 875
  • 1
  • 8
  • 30
Pih
  • 2,262
  • 13
  • 19
  • 1
    I voted up, but it seems to me the third statement should be: String base = url.getProtocol() + "://" + url.getHost() + path; – Giorgio Barchiesi Jun 03 '12 at 17:54
  • I ***THINK*** that URL getProtocol() returns the "://", but I havent tested :( – Pih Jun 13 '12 at 12:33
  • @Pih at least in Java 6, it doesn't. You must add it. Think that "://" is not part of the protocol name. – PhoneixS Oct 14 '13 at 10:33
  • url string needs check if it has protocol else malformed url exception is thrown. – chin87 May 03 '16 at 05:43
  • 4
    Looks like in the event of port being different than default, it's better to use url.getAuthority() rather than getHost(). info: https://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html – wtk Feb 14 '17 at 12:56
  • also better to use getPath() instead of getFileName(). getFileName() also returns the query part and that could contain many slashes ... – ChangeRequest Sep 27 '17 at 15:17
7

You use the java.net.URL class to resolve relative URLs.

For the first case: removing the filename from the path:

new URL(new URL(s1), ".").toString()

For the second case: setting the root path:

new URL(new URL(s2), "/").toString()
Ernesto
  • 612
  • 6
  • 12