11

I need to measure the execution time of query on Apache spark (Bluemix). What I tried:

import time

startTimeQuery = time.clock()
df = sqlContext.sql(query)
df.show()
endTimeQuery = time.clock()
runTimeQuery = endTimeQuery - startTimeQuery

Is it a good way? The time that I get looks too small relative to when I see the table.

Yakov
  • 8,699
  • 25
  • 100
  • 182

5 Answers5

16

To do it in a spark-shell (Scala), you can use spark.time().

See another response by me: https://stackoverflow.com/a/50289329/3397114

df = sqlContext.sql(query)
spark.time(df.show())

The output would be:

+----+----+
|col1|col2|
+----+----+
|val1|val2|
+----+----+
Time taken: xxx ms

Related: On Measuring Apache Spark Workload Metrics for Performance Troubleshooting.

Tyrone321
  • 923
  • 9
  • 19
  • 1
    is there something that i should including as part of my jupyter to execute spark.time.. it is showing that AttributeError: 'SparkSession' object has no attribute 'time' ( I am using pyspark.. is this only available in the scala version?) – E B Dec 17 '18 at 06:31
  • @EB I was using Scala on EMR. I don't know whether PySpark has `time()` – Tyrone321 Dec 22 '18 at 06:08
  • @Tyrone321 It doesn't. (Still) – lightsong Apr 08 '20 at 14:47
11

I use System.nanoTime wrapped around a helper function, like this -

def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}

time {
  df = sqlContext.sql(query)
  df.show()
}
shridharama
  • 800
  • 9
  • 17
7

Update: No, using time package is not the best way to measure execution time of Spark jobs. The most convenient and exact way I know of is to use the Spark History Server.

On Bluemix, in your notebooks go to the "Paelette" on the right side. Choose the "Evironment" Panel and you will see a link to the Spark History Server, where you can investigate the performed Spark jobs including computation times.

Sven Hafeneger
  • 801
  • 6
  • 12
  • I know the OP accepted the answer, but strangely enough it doesn't literally answer his question i.e., using time.clock() to measure the query execution time. I had the same question, that's why I ended up here, but at the end there is no answer. – Nadjib Mami Oct 19 '16 at 07:53
  • @nadjib-mami Ops, good point, missed the simple "No" and went directly to the solution :) Thanks! – Sven Hafeneger Nov 07 '18 at 08:16
4

SPARK itself provides much granular information about each stage of your Spark Job.

You can view your running job on http://IP-MasterNode:4040 or You can enable History server for analyzing the jobs at a later time.

Refer here for more info on History server.

Sumit
  • 1,362
  • 7
  • 9
  • 2
    The OP is asking about the Apache Spark Service on Bluemix, so not running their own spark cluster under their own control; e.g. it does not expose the ui on 4040. – Randy Horman Apr 29 '16 at 12:07
-1

For those looking for / needing a python version
(as pyspark google search leads to this post) :

from time import time
from datetime import timedelta

class T():
    def __enter__(self):
        self.start = time()
    def __exit__(self, type, value, traceback):
        self.end = time()
        elapsed = self.end - self.start
        print(str(timedelta(seconds=elapsed)))

Usage :

with T():
    //spark code goes here

As inspired by : https://blog.usejournal.com/how-to-create-your-own-timing-context-manager-in-python-a0e944b48cf8

Proved useful when using console or whith notebooks (jupyter magic %%time an %timeit are limited to cell scope, which is inconvenient when you have shared objects across notebook context)

SE_net4 the downvoter
  • 21,043
  • 11
  • 69
  • 107
Mehdi LAMRANI
  • 10,556
  • 13
  • 74
  • 115