3

Frankly i'm not sure if this feature exist?sorry for that

My requirement is to send spark analysed data to file server on daily basis, file server supports file transfer through SFTP and REST Webservice post call.

Initial thought was to save Spark RDD to HDFS and transfer to fileserver through SFTP. I would like to know is it possible to upload the RDD directly by calling REST service from spark driver class without saving to HDFS. Size of the data is less than 2MB

Sorry for my bad english!

prakash
  • 91
  • 2
  • 9

2 Answers2

2

There is no specific way to do that with Spark. With that kind of data size it will not be worth it to go through HDFS or another type of storage. You can collect that data in your driver's memory and send it directly. For a POST call you can just use plain old java.net.URL, which would look something like this:

import java.net.{URL, HttpURLConnection}

// The RDD you want to send
val rdd = ???

// Gather data and turn into string with newlines
val body = rdd.collect.mkString("\n")

// Open a connection
val url = new URL("http://www.example.com/resource")
val conn = url.openConnection.asInstanceOf[HttpURLConnection]

// Configure for POST request
conn.setDoOutput(true);
conn.setRequestMethod("POST");

val os = conn.getOutputStream;
os.write(input.getBytes);
os.flush;

A much more complete discussion of using java.net.URL can be found at this question. You could also use a Scala library to handle the ugly Java stuff for you, like akka-http or Dispatch.

Community
  • 1
  • 1
sgvd
  • 3,489
  • 14
  • 26
  • I want to point out that you don't actually need to collect() the results to POST to web service. If the RDD is used - rather than a collection of results - each executor will make the web service calls for its partitions. The parallel web service requesting may be desired in some cases. – AssHat_ Feb 25 '16 at 04:54
  • Well, In my case I can't collect rdd otherwise OutOfMemoryError) – Dennis Glot Aug 22 '19 at 13:51
0

Spark itself does not provide this functionality (it is not a general-purpose http client). You might consider using some existing rest client library such as akka-http, spray or some other java/scala client library.

That said, you are by no means obliged to save your data to disk before operating on it. You could for example use collect() or foreach methods on your RDD in combination with your REST client library.

Jakob Odersky
  • 1,186
  • 9
  • 19