OutOfMemoryError while generating 60 million JSON file using PySpark

Question

I'm able to successfully generate 60 million records CSV file from Oracle db using below PySpark code by connecting through jdbc.

Then now i want output in JSON format, so i added this line of code : df1.toPandas().to_json("/home/user1/empdata.json", orient='records') , but i'm getting OutOfMemoryError while generating json.

Any one please recommend me if any code changes required, please .

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Emp data Extract") \
    .config("spark.some.config.option", " ") \
    .getOrCreate()

def generateData():
    try:
        jdbcUrl = "jdbc:oracle:thin:USER/pwd@//hostname:1521/dbname"
        jdbcDriver = "oracle.jdbc.driver.OracleDriver"
        df1 = spark.read.format('jdbc').options(url=jdbcUrl, dbtable="(SELECT * FROM EMP) alias1", driver=jdbcDriver, fetchSize="2000").load()
        #df1.coalesce(1).write.format("csv").option("header", "true").save("/home/user1/empdata" , index=False)
        df1.toPandas().to_json("/home/user1/empdata.json", orient='records')
    except Exception as err:
        print(err)
        raise
    # finally:
    # conn.close()

if __name__ == '__main__':
    generateData()

Error Log:

2019-04-15 05:17:06 WARN  Utils:66 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
[Stage 0:>                                                          (0 + 1) / 1]2019-04-15 05:20:22 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3236)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
        at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
        at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:220)
        at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:173)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:552)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:256)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2019-04-15 05:20:22 ERROR SparkUncaughtExceptionHandler:91 - Uncaught exception in thread Thread[Executor task launch worker for task 0,5,main]
java.lang.OutOfMemoryError: Java heap space

As requested by Admin, i'm updating my comments : This is some what different question, other outoutmemory issues also there, but in different scenario there are getting. Error may be same, but problem is different. In my case, i'm getting due to huge data.

The file is too large to be handled in the heap memory. One way to handle this is to go with the buffered IO streams. — tush4r, Apr 17 '19 at 08:39
Ok, thanks for your response. But at the same time i'm able to generate a csv with 60 m records. You could see the commented line for csv. Now only issue with json. — RK., Apr 17 '19 at 08:42
Sorry for my limited knowledge on Spark, I proposed what's possible in general terms. I think what @Arnon Rotem-Gal-Oz mentioned should do the trick for you. — tush4r, Apr 17 '19 at 08:45

score 3 · Accepted Answer · answered Apr 17 '19 at 08:43

3

if you want to save in JSON you should use Spark's write command - what you currently do is bring all the data to the driver and try to load it into a pandas dataframe

df1.write.format('json').save('/path/file_name.json')

if you need a single file you can try

df1.coalesce(1).write.format('json').save('/path/file_name.json')

answered Apr 17 '19 at 08:43

Arnon Rotem-Gal-Oz

23,410
2
43
66

Yes now i'm getting output, but in json file missing braces [ ] , its of 20gb size. i can not update manually, is there any option to include these also. – RK. Apr 17 '19 at 10:49
I mean, starting and ending braces of a json. – RK. Apr 17 '19 at 11:51
I don't think you can - but if you need to read it in pandas later you can with pd.read_json("filenanme.json",lines=True) – Arnon Rotem-Gal-Oz Apr 17 '19 at 12:06

OutOfMemoryError while generating 60 million JSON file using PySpark

1 Answers1