1

I have a partitioned Hive table in parquet format. Each partition of the table has one parquet file in it.

Using Spark i want to read from that table, do a mapjoin and write to another table (same partition). This is all possible without a shuffle. My problem is, that in the target table I get multiple files in each partition. I think this is because spark breaks the input files into multiple splits. Each split then results in a file in the target table.

I have tried setting mapreduce.input.fileinputformat.split.minsize bigger than the max filesize, but it did not have any effect. Is this the right config? Where do I need to exactly set it in spark?

UPDATE:
Ok, to make it hopefully more clear what I want...
I have two hive tables A and B. Both are partitioned by the same column. Im talking about good old Hive partitions here. In table A each partition has exactly one file. Table B is empty.

Using Spark, I want to load data from multiple Hive partitions of table A, do a mapjoin with a small table C and load the result into table B.

Currently I am getting multiple files in each partition of table B. This happens, because when Spark reads the files from table A, each file is split into multiple Dataframe / RDD partitions in Spark. Each of these partitions is then processed in a separate task. Each task then produces one output file.

What I normally do is, I repartition the dataframe by the partition columns of table B. This gives me one file for each partition. The downside is, it requires a shuffle.

To avoid this, I want to know if there is a way, to have exactly one spark task to read and process each file and not split it in multiple RDD partitions / tasks?

Joha
  • 737
  • 5
  • 25
  • use coalesce to decrease the number of partitions juste before writing your dataframe – firsni Oct 09 '19 at 11:56
  • The problem with coalesce is, that i need to set a number of partitions. And i don't know that number – Joha Oct 09 '19 at 12:05
  • if you put 1 it's one file as output. You choose the number of files you want – firsni Oct 09 '19 at 14:18
  • what i actually want is repartition by the partition column(s). This would give me one file per partition in the target table. But this results in a shuffle that is actually not necessary. And since my source table is very large, the shuffle takes a lot of time – Joha Oct 09 '19 at 16:09
  • _"I don't know the number [iof partitions to set]"_ -- but you know how many executors are running, right? >> If you have 3 Spark executors and do not need a repartition-by-key, then `coalesce(3)` shoud not trigger any kind of shuffle, just a **local** merging... to be tested, of course >> https://datanoon.com/blog/spark_repartition_coalesce/#4-coalesce >> https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/avoid_groupbykey_when_performing_an_associative_re/use-coalesce-to-repartition-in-decrease-number-of-partition.html – Samson Scharfrichter Oct 09 '19 at 20:28
  • >> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html#coalesce – Samson Scharfrichter Oct 09 '19 at 20:28
  • Also, recommended reading about the mysterious case of "OMG I always get exactly 200 partitions whatever I try" >> https://stackoverflow.com/questions/45704156/what-is-the-difference-between-spark-sql-shuffle-partitions-and-spark-default-pa/45704560 – Samson Scharfrichter Oct 09 '19 at 20:34
  • Note that there is no "split" in Spark. The term is "partition". Which is _not_ the same as Hive partitions... Really not. It's possible I don't understand your question, and that you are confused by my comments and by the links I added. – Samson Scharfrichter Oct 09 '19 at 20:36
  • Please see my edits – Joha Oct 11 '19 at 10:31

0 Answers0