9

I have Cassandra database from which i analyzed the data using SparkSQL through Apache Spark. Now i want to insert those analyzed data into PostgreSQL . Is there any ways to achieve this directly apart from using the PostgreSQL driver (I achieved it using postREST and Driver i want to know whether there is any methods like saveToCassandra())?

George Kagan
  • 5,207
  • 8
  • 44
  • 49
Justin
  • 725
  • 12
  • 30

4 Answers4

13

At the moment there is no native implementation of writing the RDD to any DBMS. Here are the links to the related discussions in the Spark user list: one, two

In general, the most performant approach would be the following:

  1. Validate the number of partitions in RDD, it should not be too low and too high. 20-50 partitions should be fine, if the number is lower - call repartition with 20 partitions, if higher - call coalesce to 50 partitions
  2. Call the mapPartition transformation, inside of it call the function to insert the records to your DBMS using JDBC. In this function you open the connection to your database and use the COPY command with this API, it would allow you to eliminate the need for a separate command for each record - this way the insert would be processed much faster

This way you would insert the data into Postgres in a parallel fashion utilizing up to 50 parallel connection (depends on your Spark cluster size and its configuration). The whole approach might be implemented as a Java/Scala function accepting the RDD and the connection string

0x0FFF
  • 4,688
  • 2
  • 17
  • 26
2

You can use Postgres copy api to write it, its much faster that way. See following two methods - one iterates over RDD to fill the buffer that can be saved by copy api. Only thing you have to take care of is creating correct statement in csv format that will be used by copy api.

def saveToDB(rdd: RDD[Iterable[EventModel]]): Unit = {
        val sb = mutable.StringBuilder.newBuilder
        val now = System.currentTimeMillis()

        rdd.collect().foreach(itr => {
            itr.foreach(_.createCSV(sb, now).append("\n"))
        })

        copyIn("myTable",  new StringReader(sb.toString), "statement")
        sb.clear
    }


def copyIn(tableName: String, reader: java.io.Reader, columnStmt: String = "") = {
        val conn = connectionPool.getConnection()
        try {
            conn.unwrap(classOf[PGConnection]).getCopyAPI.copyIn(s"COPY $tableName $columnStmt FROM STDIN WITH CSV", reader)
        } catch {
            case se: SQLException => logWarning(se.getMessage)
            case t: Throwable => logWarning(t.getMessage)
        } finally {
            conn.close()
        }
    }
nont
  • 8,804
  • 6
  • 59
  • 78
smishra
  • 2,322
  • 22
  • 24
  • won't the sb StringBuilder buffer grow without bound as per the number of records in the EventModel RDD ? why won't you run out of memory? – nont Aug 25 '15 at 20:08
  • I have been using this for my solution that has been running for months now and I have not seen it going out of memory so far. The volume of data that I have is quite sizeable too - 100000/sec. Moreover, if you are concerned about this, you can always have another check based on which you call copyIn and clear the buffer. – smishra Sep 09 '15 at 20:19
1

Answer by 0x0FFF is good. Here is an additional point that would be useful.

I use foreachPartition to persist to external store. This is also inline with the design pattern Design Patterns for using foreachRDD given in Spark documentation https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#output-operations-on-dstreams

Example:

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}
jsr
  • 161
  • 1
  • 4
1

The answers above refers to old spark versions, in spark 2.* there is jdbc connector, enable write directly to RDBS from a dataFrame.

example:

jdbcDF2.write.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
          properties={"user": "username", "password": "password"})

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

RELW
  • 101
  • 1
  • 9