I'm able to successfully generate 60 million records CSV file from Oracle db using below PySpark code by connecting through jdbc.
Then now i want output in JSON format, so i added this line of code : df1.toPandas().to_json("/home/user1/empdata.json", orient='records')
, but i'm getting OutOfMemoryError while generating json.
Any one please recommend me if any code changes required, please .
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Emp data Extract") \
.config("spark.some.config.option", " ") \
.getOrCreate()
def generateData():
try:
jdbcUrl = "jdbc:oracle:thin:USER/pwd@//hostname:1521/dbname"
jdbcDriver = "oracle.jdbc.driver.OracleDriver"
df1 = spark.read.format('jdbc').options(url=jdbcUrl, dbtable="(SELECT * FROM EMP) alias1", driver=jdbcDriver, fetchSize="2000").load()
#df1.coalesce(1).write.format("csv").option("header", "true").save("/home/user1/empdata" , index=False)
df1.toPandas().to_json("/home/user1/empdata.json", orient='records')
except Exception as err:
print(err)
raise
# finally:
# conn.close()
if __name__ == '__main__':
generateData()
Error Log:
2019-04-15 05:17:06 WARN Utils:66 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
[Stage 0:> (0 + 1) / 1]2019-04-15 05:20:22 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:220)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:173)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:552)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:256)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-04-15 05:20:22 ERROR SparkUncaughtExceptionHandler:91 - Uncaught exception in thread Thread[Executor task launch worker for task 0,5,main]
java.lang.OutOfMemoryError: Java heap space
As requested by Admin, i'm updating my comments : This is some what different question, other outoutmemory issues also there, but in different scenario there are getting. Error may be same, but problem is different. In my case, i'm getting due to huge data.