2

I am using Sparklyr to run a Spark application in local mode on a virtual machine with 244GB of RAM. In my code I use spark_read_csv() to read in ~50MB of csvs from one folder and then ~1.5GB of csvs from a second folder. My issue is that the application throws an error when trying to read in the second folder.

As I understand it, the issue is that the default RAM available to the driver JVM is 512MB - too small for the second folder (in local mode all operations are run within the driver JVM, as described here How to set Apache Spark Executor memory. So I need to increase the spark.driver.memory parameter to something larger.

The issue is that I cannot set this parameter through the normal methods described in the sparklyr documentation (i.e. via spark_config(), the config.yml file, or the spark-defaults.conf file):

in local mode, by the time you run spark-submit, a JVM has already been launched with the default memory settings, so setting "spark.driver.memory" in your conf won't actually do anything for you. Instead, you need to run spark-submit as follows:

bin/spark-submit --driver-memory 2g --class your.class.here app.jar

(from How to set Apache Spark Executor memory).

I thought I could replicate the bin/spark-submit command above by adding the sparklyr.shell.driver-memory option to the config.yml; as stated in the Sparklyr documentation; sparklyr.shell* options are command line parameters that get passed to spark-submit, i.e. adding sparklyr.shell.driver-memory: 5G to the config.yml file should be equivalent to running bin/spark-submit --driver-memory 5G.

I have now tried all of the above options and none of them change driver memory in the Spark application (which I check by looking at the 'Executors' tab of the Spark UI).

So how can I change driver memory when running Spark in local mode via Sparklyr?

iehrlich
  • 3,524
  • 4
  • 30
  • 42
jay
  • 447
  • 1
  • 6
  • 19

2 Answers2

4

Thanks for the suggestions @Aydin K. Ultimately I was able to configure driver memory by first updating java to 64bit (allows utilisation of >4G of RAM in the JVMs), then using the sparklyr.shell* parameters within the spark_config() object:

config <- spark_config()
config$`sparklyr.shell.driver-memory` <- '30G'
config$`sparklyr.shell.executor-memory` <- '30G'
sc <- spark_connect(master='local', version='2.0.1', config=config)
jay
  • 447
  • 1
  • 6
  • 19
0

I had the same issue as you and no luck with my mavenized java application (local[*]) with adjusting the local memory settings. Tried lot of combinations (spark-env.sh, spark-defaults.conf etc)..

Therefore I did following workaround:

1) Add desired memory size parameters to: /opt/spark/conf/spark-defaults.conf

spark.driver.memory     4g
spark.executor.memory   2g

2) Build a jar (mvn package in my case)

3) Submit application from command line via spark-submit:

spark-submit --repositories https://mvnrepository.com --packages graphframes:graphframes:0.5.0-spark2.1-s_2.10 --class com.mypackage.myApp --verbose --master local[*] ./target/com.mypackage.myApp-1.0.jar 

And voila, no more java "out of memory" space issues :-) Also the spark UI shows now the correct value in the executor tab.

Aydin K.
  • 2,782
  • 32
  • 40
  • thank you for the suggested workaround. If I submit the spark application directly using spark-submit then how do I 'connect' sparklyr to spark? Ideally I would be following the workflow described in the sparklyr documentation, i.e. sending sparklyr code to Spark via RStudio. – jay Jun 21 '17 at 21:57
  • To be honest I've not worked with RStudio and sparklyR, but did you also try the --sparklyr.shell.executor-memory 5G parameter (instead of the driver related one) – Aydin K. Jun 22 '17 at 07:52
  • also have a look on this solution: https://stackoverflow.com/questions/41384336/running-out-of-heap-space-in-sparklyr-but-have-plenty-of-memory – Aydin K. Jun 22 '17 at 07:54