Is there a way other than repartition(which slows the processing) to combine all 1mb files into multiple big files?
run spark code on 500Gb of data , on 100 executers 24 cores each, but save them into large files with 128mb each. now it is saving 1 mb each file.
spark.sql("set pyspark.hadoop.hive.exec.dynamic.partition=true")
spark.sql("set pyspark.hadoop.hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("set hive.exec.dynamic.partition=true")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("set hive.merge.tezfiles=true")
spark.sql("SET hive.merge.sparkfiles = true")
spark.sql("set hive.merge.smallfiles.avgsize=128000000")
spark.sql("set hive.merge.size.per.task=128000000")