7

I am trying to load a (very big) serialized RDD of objects into the memory of a cluster of ec2 nodes, and then do some extraction on those objects and store the resulting RDD on disk (as object files). Unluckily I get SocketException: Connection reset once and SocketTimeoutException: Read timed out a few times.

Here is the relevant part of my code:

val pairsLocation = args(0)
val pairsRDD = sc.objectFile[Pair](pairLocation)
// taking individual objects out of "Pair" objects (containing two of those simple objects)
val extracted = pairsRDD.filter(myFunc(_._1)).
      flatMap(x => List(x._1, x._2)).distinct
val savePath = "s3 URI"
extracted.saveAsObjectFile(savePath)

Here are the details of the errors (warnings) I get:

15/03/12 18:40:27 WARN scheduler.TaskSetManager: Lost task 574.0 in stage 0.0 (TID 574, ip-10-45-14-27.us-west-2.compute.internal): 
java.net.SocketException: Connection reset
  at java.net.SocketInputStream.read(SocketInputStream.java:196)
  at java.net.SocketInputStream.read(SocketInputStream.java:122)
  at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
  at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
  at sun.security.ssl.InputRecord.read(InputRecord.java:509)
  at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
  at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
  at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
  at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
  at org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)
  at java.io.FilterInputStream.read(FilterInputStream.java:133)
  at org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)
  at org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:76)
  at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:136)
  at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.read(NativeS3FileSystem.java:98)
  at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
  at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
  at java.io.DataInputStream.readFully(DataInputStream.java:195)
  at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
  at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
  at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1988)
  at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2120)
  at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:244)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:210)
  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
  at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
  at org.apache.spark.scheduler.Task.run(Task.scala:56)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)


15/03/12 18:42:16 WARN scheduler.TaskSetManager: Lost task 380.0 in stage 0.0 (TID 380, ip-10-47-3-111.us-west-2.compute.internal):
 java.net.SocketTimeoutException: Read timed out
  at java.net.SocketInputStream.socketRead0(Native Method)
  at java.net.SocketInputStream.read(SocketInputStream.java:152)
  at java.net.SocketInputStream.read(SocketInputStream.java:122)
  at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
  at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
  at sun.security.ssl.InputRecord.read(InputRecord.java:509)
  at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
  at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
  at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
  at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
  at org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)
  at java.io.FilterInputStream.read(FilterInputStream.java:133)
  at org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)
  at org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:76)
  at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:136)
  at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.read(NativeS3FileSystem.java:98)
  at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
  at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
  at java.io.DataInputStream.readFully(DataInputStream.java:195)
  at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
  at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
  at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1988)
  at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2120)
  at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:244)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:210)
  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
  at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
  at org.apache.spark.scheduler.Task.run(Task.scala:56)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
Pragmatic geek
  • 4,767
  • 3
  • 37
  • 47
  • What level of access does the Apache Spark project provide for the HttpClient that appears to be called from the Apache Hadoop or Jets3t code. My thought is that something ought to let you specify the HttpClient connection parameters somewhere. In that way, you could better configure your connections to the server you want to hit. However, the server you're connecting to could have it's own timeouts, such that this might never work if the objects are bigger than it can feed to you before it times out. – n0741337 Mar 12 '15 at 23:05
  • @Metallica did you find a solution? I get a time out when I read from ElasticSearch (via ES Hadoop APIs) into Spark RDDs after ab 10 minutes. – Adrian Mar 26 '15 at 18:12
  • 1
    @Adrian I increased the number of CPU cores per each Spark task, so that each task gets done quicker (before the socket timeout) and fewer number of partitions get downloaded at the same time. For example if you have 16 cores, when you increase the number of cores per task from 2 to 4, only 4 partitions get downloaded to each worker at a time. I also increased the timeout threshold of Akka. However, I still get some of those errors and I have some data loss, but that's not significant. – Pragmatic geek Mar 30 '15 at 16:10
  • Here is the script I have to run my class with Spark: `/root/spark/bin/spark-submit --driver-memory 8g --conf spark.akka.frameSize=100 --conf spark.task.cpus=4 --conf spark.akka.timeout=200 --class "$class" --master "$master" "$jar" $@ 1>"$class".log 2>"$class".err` – Pragmatic geek Mar 30 '15 at 16:12
  • @Metallica thanks so much! – Adrian Mar 31 '15 at 00:43
  • Thank you for providing the settings in command format. well done. – deepelement Jan 24 '17 at 16:18

0 Answers0