0

I run nutch in my hadoop cluster. When the job gets to the step fetch data I get java.net.SocketException: Connection reset. Here's the full stacktrace:

2013-10-09 00:34:05,922 INFO org.apache.nutch.fetcher.Fetcher: fetch of Url error : xxxxxxx  failed with: java.net.SocketException: Connection reset
2013-10-09 00:34:05,923 ERROR org.apache.nutch.protocol.httpclient.Http: Failed to get protocol output
java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:189)
    at java.net.SocketInputStream.read(SocketInputStream.java:121)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
    at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77)
    at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
    at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1115)
    at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
    at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1832)
    at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1590)
    at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:995)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:397)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:170)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
    at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:94)
    at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
    at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)
    at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
harpun
  • 3,932
  • 1
  • 33
  • 38
cldo
  • 1,685
  • 6
  • 21
  • 26
  • Can you access (possibly multiple times in row) the target URL with your browser/curl/wget from the host, that hadoop is running on? See http://stackoverflow.com/questions/62929/java-net-socketexception-connection-reset for explanation on the exception itself. – harpun Oct 08 '13 at 18:35

1 Answers1

0

You have to indicate url's protocol on your seedlist! for example:

http://stackoverflow.com/
https://google.com
ftp://foo.bar