How to measure the execution time of a query on Spark

Question

I need to measure the execution time of query on Apache spark (Bluemix). What I tried:

import time

startTimeQuery = time.clock()
df = sqlContext.sql(query)
df.show()
endTimeQuery = time.clock()
runTimeQuery = endTimeQuery - startTimeQuery

Is it a good way? The time that I get looks too small relative to when I see the table.

Tyrone321 · Answer 1 · 2020-04-14T09:05:42.423

16

To do it in a spark-shell (Scala), you can use spark.time().

See another response by me: https://stackoverflow.com/a/50289329/3397114

df = sqlContext.sql(query)
spark.time(df.show())

The output would be:

+----+----+
|col1|col2|
+----+----+
|val1|val2|
+----+----+
Time taken: xxx ms

Related: On Measuring Apache Spark Workload Metrics for Performance Troubleshooting.

edited Apr 14 '20 at 09:05

answered May 11 '18 at 10:07

Tyrone321

923
9
19

1

is there something that i should including as part of my jupyter to execute spark.time.. it is showing that AttributeError: 'SparkSession' object has no attribute 'time' ( I am using pyspark.. is this only available in the scala version?) – E B Dec 17 '18 at 06:31
@EB I was using Scala on EMR. I don't know whether PySpark has `time()` – Tyrone321 Dec 22 '18 at 06:08
@Tyrone321 It doesn't. (Still) – lightsong Apr 08 '20 at 14:47

score 11 · Answer 2 · answered Sep 07 '16 at 23:58

11

I use System.nanoTime wrapped around a helper function, like this -

def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}

time {
  df = sqlContext.sql(query)
  df.show()
}

answered Sep 07 '16 at 23:58

shridharama

800
9
17

Sven Hafeneger · Accepted Answer · 2018-05-17T13:04:52.890

7

Update: No, using time package is not the best way to measure execution time of Spark jobs. The most convenient and exact way I know of is to use the Spark History Server.

On Bluemix, in your notebooks go to the "Paelette" on the right side. Choose the "Evironment" Panel and you will see a link to the Spark History Server, where you can investigate the performed Spark jobs including computation times.

edited May 17 '18 at 13:04

answered Apr 29 '16 at 10:07

Sven Hafeneger

801
6
12

I know the OP accepted the answer, but strangely enough it doesn't literally answer his question i.e., using time.clock() to measure the query execution time. I had the same question, that's why I ended up here, but at the end there is no answer. – Nadjib Mami Oct 19 '16 at 07:53
@nadjib-mami Ops, good point, missed the simple "No" and went directly to the solution :) Thanks! – Sven Hafeneger Nov 07 '18 at 08:16

score 4 · Answer 4 · answered Jan 06 '16 at 09:48

4

SPARK itself provides much granular information about each stage of your Spark Job.

You can view your running job on http://IP-MasterNode:4040 or You can enable History server for analyzing the jobs at a later time.

Refer here for more info on History server.

answered Jan 06 '16 at 09:48

Sumit

1,362
7
9

2

The OP is asking about the Apache Spark Service on Bluemix, so not running their own spark cluster under their own control; e.g. it does not expose the ui on 4040. – Randy Horman Apr 29 '16 at 12:07

score -1 · Answer 5 · edited Jan 13 '21 at 14:22

For those looking for / needing a python version
(as pyspark google search leads to this post) :

from time import time
from datetime import timedelta

class T():
    def __enter__(self):
        self.start = time()
    def __exit__(self, type, value, traceback):
        self.end = time()
        elapsed = self.end - self.start
        print(str(timedelta(seconds=elapsed)))

Usage :

with T():
    //spark code goes here

As inspired by : https://blog.usejournal.com/how-to-create-your-own-timing-context-manager-in-python-a0e944b48cf8

Proved useful when using console or whith notebooks (jupyter magic %%time an %timeit are limited to cell scope, which is inconvenient when you have shared objects across notebook context)

How to measure the execution time of a query on Spark

5 Answers5

Linked

Related