java.lang.OutOfMemoryError sparklyr

Question

I am getting a java.lang.OutOfMemoryError when pulling data from a sparklyr table. I am running the code on the university computer cluster, so it should hv plenty of spare memory to pull one variable from my 1.48Gb database (or when I collect the entire database, by using the command collect()). And I have already tried many different spark configurations, as described in https://github.com/rstudio/sparklyr/issues/379 and Running out of heap space in sparklyr, but have plenty of memory, but the problem still persists.

Also, when I type java -version on the terminal while connected to the cluster, I get

java version "1.7.0_141" OpenJDK Runtime Environment (rhel-2.6.10.1.el6_9-x86_64 u141-b02) OpenJDK 64-Bit Server VM (build 24.141-b02, mixed mode)

so I don't think the problem is with Java, as suggested in How do I configure driver memory when running Spark in local mode via Sparklyr?

Below is the output file:

R version 3.4.1 (2017-06-30) -- "Single Candle"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

[Previously saved workspace restored]

> Sys.info()['nodename']
nodename 
"econ14" 
> 
> #memory.limit(size=10000)
> 
> #options(java.parameters = "-Xmx8048m")
> 
> 
> rm(list = ls()) #clear database
> library("sparklyr",lib.loc="/econ_s/saraiva/R_libs")
> library(dplyr,lib.loc="/econ_s/saraiva/R_libs")

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> library("config",lib.loc="/econ_s/saraiva/R_libs")#

Attaching package: ‘config’

The following objects are masked from ‘package:base’:

    get, merge

> library("rappdirs",lib.loc="/econ_s/saraiva/R_libs")#
> library("withr",lib.loc="/econ_s/saraiva/R_libs")#
> library("bindrcpp",lib.loc="/econ_s/saraiva/R_libs")#
> 
> 
> #Sys.setenv("SPARK_MEM" = "20g")
> config <- spark_config()
> #config[["sparklyr.shell.conf"]] <- "spark.driver.extraJavaOptions=-XX:MaxHeapSize=24G"
> config$`sparklyr.shell.driver-memory` <- "10G"
> config$`sparklyr.shell.executor-memory` <- "10G"
> config$`spark.driver.maxResultSize` <- "10g"
> config$`spark.yarn.executor.memoryOverhead` <- "16g"
> 
> 
> 
> sc<-spark_connect(master = "local",config = config)
* Using Spark: 2.1.0
> 
> 
> 
> 
> test=spark_read_json(sc = sc, name = "videos", path = "file/path.json")
> 
> #=====Select a subset of variables:=====
> a<-select(test, asin, helpful,overall)#works
> 
> #=====Filter Variables:=================
>a<- filter(test, asin=='B000H0X79O')#works
> #Using a function not defined in dplyr: (causes computer to run out of memory)
> tr<-select(test,  reviewText)
> tr<-pull(tr)
Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2367)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
    at java.lang.StringBuilder.append(StringBuilder.java:132)
    at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
    at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:357)
    at scala.collection.mutable.ArrayOps$ofRef.addString(ArrayOps.scala:186)
    at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:323)
    at scala.collection.mutable.ArrayOps$ofRef.mkString(ArrayOps.scala:186)
    at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:325)
    at scala.collection.mutable.ArrayOps$ofRef.mkString(ArrayOps.scala:186)
    at sparklyr.Utils$.collectImplString(utils.scala:136)
    at sparklyr.Utils$.collectImpl(utils.scala:174)
    at sparklyr.Utils$$anonfun$collect$1.apply(utils.scala:198)
    at sparklyr.Utils$$anonfun$collect$1.apply(utils.scala:198)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.immutable.Range.foreach(Range.scala:160)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at sparklyr.Utils$.collect(utils.scala:198)
    at sparklyr.Utils.collect(utils.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at sparklyr.Invoke$.invoke(invoke.scala:102)
    at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
Execution halted

One frequent reason for an OutOfMemoryError is infinite recursion. Without your code I cannot tell whether it could happen here. — Ole V.V., Aug 31 '17 at 14:50
the code that I posted there with input/output is not enough? — AngryR11, Aug 31 '17 at 15:59
Sorry, I don’t think I saw all of it. Your stacktrace shows recursion, but not very deep, I wouldn’t think that this is the problem. Haven’t got any better suggestions, though. — Ole V.V., Aug 31 '17 at 18:03
Best, of course, if you could provide enough information that someone could reproduce your problem, but I understand that this probably is not realistic. — Ole V.V., Aug 31 '17 at 18:04
Have a look at my comment here... This solved the most JAVA out of memory errors for me https://stackoverflow.com/questions/45234844/java-lang-outofmemoryerror-gc-overhead-limit-exceeded/55243232#55243232 — drmariod, Mar 19 '19 at 14:30

java.lang.OutOfMemoryError sparklyr

0 Answers0