5

I am trying to download all the files from this directory. However, I can only get it to download the url as one file. What can I do? I tried searching for this problem and it was confusing and people were starting to suggest using httpclients instead. Thanks for your help, this is my code so far. It has been suggested that I use an input stream to attain all the files in the directory. Would that then go into an array? I tried the tutorial here http://docs.oracle.com/javase/tutorial/networking/urls/ but it didn't help me understand.

//ProgressBar/Install
            String URL_LOCATION = "http://www.futureretrogaming.tk/gamefiles/ProfessorPhys/";
            String LOCAL_FILE = filelocation.getText() + "\\ProfessorPhys\\";
            try {
                java.net.URL url = new URL(URL_LOCATION);
                HttpURLConnection connection = (HttpURLConnection) url.openConnection(); 
                connection.addRequestProperty("User-Agent", "Mozilla/4.76"); 
                //URLConnection connection = url.openConnection();
                BufferedInputStream stream = new BufferedInputStream(connection.getInputStream());
                int available = stream.available();
                byte b[]= new byte[available];
                stream.read(b);
                File file = new File(LOCAL_FILE);
                OutputStream out  = new FileOutputStream(file);
                out.write(b);
            } catch (Exception e) {
                System.err.println(e);
            }

I also found this code which will return a List of files to download. Can someone help me combine the two codes?

public class GetAllFilesInDirectory {

public static void main(String[] args) throws IOException {

    File dir = new File("dir");

    System.out.println("Getting all files in " + dir.getCanonicalPath() + " including those in subdirectories");
    List<File> files = (List<File>) FileUtils.listFiles(dir, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE);
    for (File file : files) {
        System.out.println("file: " + file.getCanonicalPath());
    }

}

}

Kyle
  • 2,069
  • 9
  • 30
  • 59
  • 1
    Unless you can access the resources via a URL, you can't. – MadProgrammer Jun 14 '13 at 04:50
  • Well I can now, I edited the htaccess file to allow that. – Kyle Jun 14 '13 at 04:53
  • Also, it would be better using Apache HttpClient as shown here: [How to use java.net.URLConnection to fire and handle HTTP requests?](http://stackoverflow.com/q/2793150/1065197) – Luiggi Mendoza Jun 14 '13 at 04:53
  • So, first you need to read and parse the result of the first URL... – MadProgrammer Jun 14 '13 at 04:54
  • Thanks for the link. But, I don't see where it talks about getting an array of all the files it needs to download – Kyle Jun 14 '13 at 04:56
  • @Kyle - is this a specific java excercise? If not try some thing like curl -- http://curl.haxx.se/ – Jayan Jun 14 '13 at 05:01
  • Yes it must be done with java. And I updated my question with more code. – Kyle Jun 14 '13 at 05:09
  • What you have is a HTML file which just *happens* to list files. You need to parse this result, extract the individual links and then download those. This is the basic concept of the Web... – MadProgrammer Jun 14 '13 at 05:10

2 Answers2

5

You need to download the page, which is the directory listing, parse it and then download the inidiviudal files linked in the page...

You could do something like...

URL url = new URL("http:www.futureretrogaming.tk/gamefiles/ProfessorPhys");
InputStream is = null;
try {
    is = url.openStream();
    byte[] buffer = new byte[1024];
    int bytesRead = -1;
    StringBuilder page = new StringBuilder(1024);
    while ((bytesRead = is.read(buffer)) != -1) {
        page.append(new String(buffer, 0, bytesRead));
    }
    // Spend the rest of your life using String methods
    // to parse the result...
} catch (IOException ex) {
    ex.printStackTrace();
} finally {
    try {
        is.close();
    } catch (Exception e) {
    }
}

Or, you can download Jsoup and use it to do all the hard work...

try {
    Document doc = Jsoup.connect("http:www.futureretrogaming.tk/gamefiles/ProfessorPhys").get();
    Elements links = doc.getElementsByTag("a");
    for (Element link : links) {
        System.out.println(link.attr("href") + " - " + link.text());
    }
} catch (IOException ex) {
    ex.printStackTrace();
}

Which outputted...

?C=N;O=D - Name
?C=M;O=A - Last modified
?C=S;O=A - Size
?C=D;O=A - Description
/gamefiles/ - Parent Directory
Assembly-CSharp-Editor-firstpass-vs.csproj - Assembly-CSharp-Edit..>
Assembly-CSharp-Editor-firstpass.csproj - Assembly-CSharp-Edit..>
Assembly-CSharp-Editor-firstpass.pidb - Assembly-CSharp-Edit..>
Assembly-CSharp-firstpass-vs.csproj - Assembly-CSharp-firs..>
Assembly-CSharp-firstpass.csproj - Assembly-CSharp-firs..>
Assembly-CSharp-firstpass.pidb - Assembly-CSharp-firs..>
Assembly-CSharp-vs.csproj - Assembly-CSharp-vs.c..>
Assembly-CSharp.csproj - Assembly-CSharp.csproj
Assembly-CSharp.pidb - Assembly-CSharp.pidb
Assembly-UnityScript-Editor-firstpass-vs.unityproj - Assembly-UnityScript..>
Assembly-UnityScript-Editor-firstpass.pidb - Assembly-UnityScript..>
Assembly-UnityScript-Editor-firstpass.unityproj - Assembly-UnityScript..>
Assembly-UnityScript-firstpass-vs.unityproj - Assembly-UnityScript..>
Assembly-UnityScript-firstpass.pidb - Assembly-UnityScript..>
Assembly-UnityScript-firstpass.unityproj - Assembly-UnityScript..>
Assembly-UnityScript-vs.unityproj - Assembly-UnityScript..>
Assembly-UnityScript.pidb - Assembly-UnityScript..>
Assembly-UnityScript.unityproj - Assembly-UnityScript..>
Assets/ - Assets/
Library/ - Library/
Professor%20Phys-csharp.sln - Professor Phys-cshar..>
Professor%20Phys.exe - Professor Phys.exe
Professor%20Phys.sln - Professor Phys.sln
Professor%20Phys.userprefs - Professor Phys.userp..>
Professor%20Phys_Data/ - Professor Phys_Data/
Script.doc - Script.doc
~$Script.doc - ~$Script.doc
~WRL0392.tmp - ~WRL0392.tmp
~WRL1966.tmp - ~WRL1966.tmp

You would then need to build a new URL for each file and read as you have already done...

For example, the href for Assembly-CSharp-Edit..> is Assembly-CSharp-Editor-firstpass-vs.csproj, which appears to a relative link, so you would need prefix this with http://www.futureretrogaming.tk/gamefiles/ProfessorPhys to make a new URL of http://www.futureretrogaming.tk/gamefiles/ProfessorPhys/Assembly-CSharp-Editor-firstpass-vs.csproj

You would need to do this for each element you want to grab

MadProgrammer
  • 323,026
  • 21
  • 204
  • 329
  • Thanks. I will really look into this! – Kyle Jun 14 '13 at 05:50
  • Y have to love blind down votes. Please have the courage to provide some feedback so tang we can all learn from the mistakes and have the opportunity to improve the answer – MadProgrammer Oct 28 '14 at 19:09
  • @MadProgrammer Sorry for the blind downvote, I should have explained: For a start, your top code would not work for whatever reason. I did not bother checking through everything, but it just wouldn't work. Secondly the JSoup added a dependancy - Which I personally think is unnecessary and messy. And also, that you expected that output, but did not realize that the output is different using different web displaying applications (e.g. apache / ngix / etc). In the end, I download "download.txt", iterated through the strings in that file, and downloaded them.I just found your post unehpful,is all. – Joehot200 Nov 01 '14 at 23:56
  • Also, it does not tell how to download subdirectories. For example if I want to download "/home" and it has the subdirectory "/home/swag", that wouldn't be downloaded. Which is important to me, especially if I want to use natives. – Joehot200 Nov 01 '14 at 23:59
  • Why wouldn't the first section work? It was tested against the original question? – MadProgrammer Nov 02 '14 at 01:46
  • As you say, this is a particular solution for a given problem, it is impossible to produce a dingle solution that would work for every solution. The purpose for jsoup was to reduce the complexity for the op, I'd personally run it through a dom and do it that way myself, as trying to write a substring solution would just be crazy talk :P. The intention was to provide a working example of the concept for a particular problem, not write the entire code for the op, it's meant to provide ideas, rather the solutions. In thus particular case, the op didn't have a download.txt file :( – MadProgrammer Nov 02 '14 at 01:52
  • First thanks for the feedback. The example wasn't meant to be stand alone solution, the op is going to need to do some work to fill the gaps, that's the point. Having parsed any number of websites over the years I can tell you the consistency is in the downloading of the url, each one needs to dealt with on a case by case bases. Personally I'd use apaches HttpClient, but apparently, you don't want a 3rd party dependency to make your life easier, as I recall, that wasn't restriction of the original question. It sounds like your needs are different for that of the ops – MadProgrammer Nov 02 '14 at 02:21
0

Have you considered tool like HTTrack, it can detect presence of anchor tag on HTML and download entire website (limited by tree level). You can also specify filter what files should be downloaded etc

If this doesn't suit your requirement, you can still use hand written Java program, except the problem is obtaining a list of files in the URL (and all subfolder within). You need to parse the HTML, gather all the anchor tags, and traverse it (which is what HTTrack is doing)

gerrytan
  • 37,387
  • 8
  • 78
  • 91