I have a partitioned Hive table in parquet format. Each partition of the table has one parquet file in it.
Using Spark i want to read from that table, do a mapjoin and write to another table (same partition). This is all possible without a shuffle. My problem is, that in the target table I get multiple files in each partition. I think this is because spark breaks the input files into multiple splits. Each split then results in a file in the target table.
I have tried setting mapreduce.input.fileinputformat.split.minsize
bigger than the max filesize, but it did not have any effect. Is this the right config? Where do I need to exactly set it in spark?
UPDATE:
Ok, to make it hopefully more clear what I want...
I have two hive tables A and B. Both are partitioned by the same column. Im talking about good old Hive partitions here. In table A each partition has exactly one file. Table B is empty.
Using Spark, I want to load data from multiple Hive partitions of table A, do a mapjoin with a small table C and load the result into table B.
Currently I am getting multiple files in each partition of table B. This happens, because when Spark reads the files from table A, each file is split into multiple Dataframe / RDD partitions in Spark. Each of these partitions is then processed in a separate task. Each task then produces one output file.
What I normally do is, I repartition the dataframe by the partition columns of table B. This gives me one file for each partition. The downside is, it requires a shuffle.
To avoid this, I want to know if there is a way, to have exactly one spark task to read and process each file and not split it in multiple RDD partitions / tasks?