0

I have the following Java code to parse a website code:

URL url = new URL(urlToParse);
URLConnection con = url.openConnection();
InputStream is =con.getInputStream(); 
BufferedReader br = new BufferedReader(new InputStreamReader(is));

urlToParse is passed as a parameter to this function and is equal to "http://www.omegatiming.com/file/download/?id=00010F0200FFFFFFFFFFFFFFFFFFFF03".
The code is coming from here .
The output is Gibberish - full of question marks and unknown characters.

I tried adding these 5 lines after the openConnection() line.

con.setRequestMethod("GET");
con.setDoOutput(true);
con.setReadTimeout(2000);
con.setChunkedStreamingMode(0);
con.connect();  

from the solution offered here, but then I get this exception:
Exception in thread "main" java.io.FileNotFoundException: http://www.omegatiming.com/file/download/?id=00010F0200FFFFFFFFFFFFFFFFFFFF03 at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1835) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440) coming from the line InputStream is =con.getInputStream();

Copying this link to the browser directs me to the website, so it couldn't be that the site is invalid, yet calling con.getresposeCode() returns 404.

When trying to get the error from getErrorStream() it prints this:

<!DOCTYPE html>
<html>
    <head>
    <title>The resource cannot be found.</title>
    <meta name="viewport" content="width=device-width" />
    <style>
     body {font-family:"Verdana";font-weight:normal;font-size: .7em;color:black;} 
     p {font-family:"Verdana";font-weight:normal;color:black;margin-top: -5px}
     b {font-family:"Verdana";font-weight:bold;color:black;margin-top: -5px}
     H1 { font-family:"Verdana";font-weight:normal;font-size:18pt;color:red }
     H2 { font-family:"Verdana";font-weight:normal;font-size:14pt;color:maroon }
     pre {font-family:"Consolas","Lucida Console",Monospace;font-size:11pt;margin:0;padding:0.5em;line-height:14pt}
     .marker {font-weight: bold; color: black;text-decoration: none;}
     .version {color: gray;}
     .error {margin-bottom: 10px;}
     .expandable { text-decoration:underline; font-weight:bold; color:navy; cursor:hand; }
     @media screen and (max-width: 639px) {
      pre { width: 440px; overflow: auto; white-space: pre-wrap; word-wrap: break-word; }
     }
     @media screen and (max-width: 479px) {
      pre { width: 280px; }
     }
    </style>
</head>

<body bgcolor="white">

        <span><H1>Server Error in '/' Application.<hr width=100% size=1 color=silver></H1>

        <h2> <i>The resource cannot be found.</i> </h2></span>

        <font face="Arial, Helvetica, Geneva, SunSans-Regular, sans-serif ">

        <b> Description: </b>HTTP 404. The resource you are looking for (or one of its dependencies) could have been removed, had its name changed, or is temporarily unavailable. &nbsp;Please review the following URL and make sure that it is spelled correctly.
        <br><br>

        <b> Requested URL: </b>/file/download/<br><br>

        <hr width=100% size=1 color=silver>

        <b>Version Information:</b>&nbsp;Microsoft .NET Framework Version:4.0.30319; ASP.NET Version:4.0.30319.34248

        </font>

</body>  

 HttpException:  A public action method &#39;download&#39; was not found on controller     &#39;SwissTiming.DocMgmt.DMSWeb.Controllers.FileController&#39;.
at System.Web.Mvc.Controller.HandleUnknownAction(String actionName)
at System.Web.Mvc.Controller.<BeginExecuteCore>b__1d(IAsyncResult asyncResult, ExecuteCoreState innerState)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.Controller.EndExecuteCore(IAsyncResult asyncResult)
at System.Web.Mvc.Controller.<BeginExecute>b__15(IAsyncResult asyncResult, Controller controller)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.Controller.EndExecute(IAsyncResult asyncResult)
at System.Web.Mvc.Controller.System.Web.Mvc.Async.IAsyncController.EndExecute(IAsyncResult asyncResult)
at System.Web.Mvc.MvcHandler.<BeginProcessRequest>b__5(IAsyncResult asyncResult, ProcessRequestState innerState)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.MvcHandler.EndProcessRequest(IAsyncResult asyncResult)
at System.Web.Mvc.MvcHandler.System.Web.IHttpAsyncHandler.EndProcessRequest(IAsyncResult result)
at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)
--><!-- 
This error page might contain sensitive information because ASP.NET is configured to show verbose error messages using &lt;customErrors mode="Off"/&gt;. Consider using &lt;customErrors mode="On"/&gt; or &lt;customErrors mode="RemoteOnly"/&gt; in production environments.-->  

And that is basically where I am stuck, and cannot understand the problem at all. I don't even know where does the ASP.NET comes from.

Other attampts to bypass the problem that did not solve it:
1. Adding
httpConnection.setRequestProperty("User-Agent","Mozilla/5.0 ( compatible ) ");
httpConnection.setRequestProperty("Accept","
/");,
as suggested here. Also tried using the userAgent from this as suggested here.
Still getting the FileNotFoundException in getInputStream().
2. adding * System.setProperty("http.agent", "");*
as mentioned here.
3. Back to the original problem (printing Gibberish)- I tried changing the call for InputStreamReader this way:
new InputStreamReader(new URL("www.website.com").openStream(), "UTF-8") as mentioned in the comment here, but it didn't change anything.
4. adding the lines:
con.setRequestMethod("POST"); con.setDoInput(true);
Still getting fileNotFoundException.

I'm pretty confused.

I'm not even sure if I have an encoding problem (since before trying to solve by adding things to the connection, there was no exception, "just" wrong output).
Or I have some other problem with the connection that I can't get input from it (and if so, what is special about this specific website, as the websites that lead me to this one, e.g http://www.omegatiming.com/Competition?id=00010F0200FFFFFFFFFFFFFFFFFFFFFF&sport=AQ&year=2015, could be parsed without a problem).

[[here][1]: Using Java to pull data from a webpage?
[here][2]: Trying to read from a URL(in Java) produces gibberish on certain occaisions
[here][3]: URLConnection FileNotFoundException for non-standard HTTP port sources
[here][4]: Setting "User-Agent" parameters for URLConnection for querying Google from a Java application
[here][5]: Setting user agent of a java URLConnection
[here][6]: Trying to read from a URL(in Java) produces gibberish on certain occaisions

[this][1]: http://www.whatsmyuseragent.com/

Community
  • 1
  • 1
Achi Even-dar
  • 387
  • 1
  • 6
  • 18

1 Answers1

0

Managed to bypass the need for having to parse the file directly from the Web.

I got pdfbox by adding the dependencies written here to my pom.xml and ran mvn clean install.
Then downloaded the file into my PC, using the information is this post.
Then (now that I have pdfbox) I added these 3 lines:

 PDDocument pdf = PDDocument.load(new File(“sample.pdf”));
 PDFTextStripper stripper = new PDFTextStripper();
 String plainText = stripper.getText(pdf);

as mentioed here.

It's not the perfect solution, it consumes memory in my PC for storing the files (perhaps possible to store only one file and delete each time, still haven't checked it) on my system and perhaps consumes too much memory of the program by having to complete parsing the full file by getText() method, but it solves my issue, which is how to parse this specific website, which is important for my program only for exracting the text in it.

[here][1]: http://pdfbox.apache.org/2.0/getting-started.html
[here][2]: http://blog.e-zest.net/extracting-text-from-a-pdf-file/

[this][1]: How to download a PDF from a given URL in Java?

Community
  • 1
  • 1
Achi Even-dar
  • 387
  • 1
  • 6
  • 18