multiple insert into a table using Apache Spark

Question

I am working on a project and i am stuck on following scenario.

I have a table: superMerge(id, name, salary)

and I have 2 other tables: table1 and table2

all the tables ( table1, table2 and superMerge) has same structure.

Now, my challenge is to insert/update superMerge table from table1 and table2. table1 is updated every 10mins and table2 every 20 mins therefore at time t=20mins i have 2 jobs trying to update same table(superMerge in this case.)

I want to understand how can i acheive this parallel insert/update/merge into superMerge table using Spark or any other hadoop application.

Can you describe more about the tables? What are you trying to do with superMerge? What's wrong with constantly appending table1 and table2 to the same location, then reading superMerge in from that location when necessary? — ayplam, Oct 31 '17 at 00:09
superMerge may be a view of tables 1 and 2 instead of a normal table, it would avoid the need to update superMerge. — Arthur Julião, Nov 01 '17 at 19:14
@ArthurJulião creating a view is an option and i will actually take that route of there is no direct solution. but i am looking a solution probably spark or phoenix provides which i am not aware of — GKV, Nov 03 '17 at 09:54

score 5 · Answer 1 · answered Nov 01 '17 at 07:14

The problem here is that the two jobs can't communicate with each other, not knowing what the other is doing. A relatively easy solution whould be to implement a basic file-based "locking" system:

Each job creates a (empty) file in a specific folder on HDFS indicating that the update/insert is in progress and removes that file if the jobs is done
Now, each jobs has to check whether such a file exists or not prior to starting the update/insert. If it exists, the job must wait until the files is gone.

score 1 · Answer 2 · answered Nov 02 '17 at 13:16

Can you control code of job1 & job2? How do you schedule those?

In general you can convert those two jobs into 1 that runs every 10 minutes. Once in 20 mins this unified job runs with different mode(merging from 2 tables), while default mode will be to merge from 1 table only. So when you have same driver - you don't need any synchronisation between two jobs(e.g. locking). This solution supposes that jobs are finishing under 10 mins.

score 1 · Answer 3 · answered Nov 02 '17 at 14:31

How large are your dataset ? Are you planning to do it in Batch (Spark) or could you stream your inserts / updates (Spark Streaming) ?

Lets assume you want to do it in batch:

Launch only one job every 10 minutes that can process the two tables. if you got Table 1 and Table 2 do a Union and join with superMerge. As Igor Berman suggested.
Be careful has your superMerge table will get bigger your join will take longer.

score 0 · Answer 4 · answered Nov 06 '17 at 16:02

I faced this situation, write the tb1 DF1 to a location1 and tb2 DF2 to location 2 and at the end just switch the paths to the super merge table, you can also do the table to table insert but that consumes a lot of runtimes especially in the hive.

overwriting to the staging locations location1 and location 2:

df1.write.mode("overwrite").partitionBy("partition").parquet(location1)

df2.write.mode("overwrite").partitionBy("partition").parquet(location2)

switching paths to super merge table :

hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location1/partition=x/' INTO TABLE super_merge_table  partition(partition=x))"

hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location2/partition=x/' INTO TABLE super_merge_table  partition(partition=x))"

You can do the parallel merging without overriding the one on other.

multiple insert into a table using Apache Spark

4 Answers4