11

How do I collect these metrics on a console (Spark Shell or Spark submit job) right after the task or job is done.

We are using Spark to load data from Mysql to Cassandra and it is quite huge (ex: ~200 GB and 600M rows). When the task the done, we want to verify how many rows exactly did spark process? We can get the number from Spark UI, but how can we retrieve that number ("Output Records Written") from spark shell or in spark-submit job.

Sample Command to load from Mysql to Cassandra.

val pt = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://...:3306/...").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "payment_types").option("user", "hadoop").option("password", "...").load()

pt.save("org.apache.spark.sql.cassandra",SaveMode.Overwrite,options = Map( "table" -> "payment_types", "keyspace" -> "test"))

I want to retrieve all the Spark UI metrics on the above task mainly Output size and Records Written.

Please help.

Thanks for your time!

Ajay Guyyala
  • 226
  • 2
  • 10
  • You mean that you can find the metrics on the spark UI, But I didn't find it with similar code(read jdbc source), where the metrics shows up on the UI? – Tom Sep 04 '18 at 09:24
  • It shows up on Spark's application UI, usually under jobs and under stages. You can see stats, executor info and individual tasks info like how much data each task is reading and how much shuffle write each task is writing etc. – Ajay Guyyala Sep 05 '18 at 13:33
  • Thanks @ajay-guyyala. I got no luck to see in on UI. I will investigate what happens. – Tom Sep 06 '18 at 00:58
  • Here are some of the sample images I found. It may not show metrics for all the jobs/stages. Also it depends on which spark version we are using. At the time when I posted this post we were using Spark 1.5.x or 1.6.x. https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjyiczO26fdAhVHCKwKHTgACQEQjRx6BAgBEAU&url=https%3A%2F%2Fcommunity.hortonworks.com%2Fquestions%2F67659%2Fwhat-are-the-important-metrics-to-notice-for-each.html&psig=AOvVaw1cfVaeYdVnmWpa7t2uOym8&ust=1536369025933899 – Ajay Guyyala Sep 07 '18 at 01:14
  • @AjayGuyyala Could you get the this data from spark UI. I also have same requirements where I have to fetch some useful data from spark UI to my java cod. – Akash Patel Oct 10 '20 at 17:18

1 Answers1

8

Found the answer. You can get the stats by using SparkListener.

If your job has no input or output metrics you might get None.get exceptions which you can safely ignore by providing if stmt.

sc.addSparkListener(new SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
    val metrics = taskEnd.taskMetrics
    if(metrics.inputMetrics != None){
      inputRecords += metrics.inputMetrics.get.recordsRead}
    if(metrics.outputMetrics != None){
      outputWritten += metrics.outputMetrics.get.recordsWritten }
  }
})

Please find the below example.

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
import org.apache.spark.storage.StorageLevel
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}

val conf = new SparkConf()
.set("spark.cassandra.connection.host", "...")
.set("spark.driver.allowMultipleContexts","true")
.set("spark.master","spark://....:7077")
.set("spark.driver.memory","1g")
.set("spark.executor.memory","10g")
.set("spark.shuffle.spill","true")
.set("spark.shuffle.memoryFraction","0.2")
.setAppName("CassandraTest")
sc.stop
val sc = new SparkContext(conf)
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)

var outputWritten = 0L

sc.addSparkListener(new SparkListener() {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
    val metrics = taskEnd.taskMetrics
    if(metrics.inputMetrics != None){
      inputRecords += metrics.inputMetrics.get.recordsRead}
    if(metrics.outputMetrics != None){
      outputWritten += metrics.outputMetrics.get.recordsWritten }
  }
})

val bp = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://...:3306/...").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "bucks_payments").option("partitionColumn","id").option("lowerBound","1").option("upperBound","14596").option("numPartitions","10").option("fetchSize","100000").option("user", "hadoop").option("password", "...").load()
bp.save("org.apache.spark.sql.cassandra",SaveMode.Overwrite,options = Map( "table" -> "bucks_payments", "keyspace" -> "test"))

println("outputWritten",outputWritten)

Result:

scala> println("outputWritten",outputWritten)
(outputWritten,16383)
user3190018
  • 1,142
  • 12
  • 22
Ajay Guyyala
  • 226
  • 2
  • 10