Hive join query returning Cartesian product on inner join

Question

I am doing inner join on two tables that are created using Hive. One is a big table "trades_bucket" and another is a small table "counterparty_bucket". They are created as follows :-

DROP TABLE IF EXISTS trades_bucket;
CREATE EXTERNAL TABLE trades_bucket(
parentId STRING,
BookId STRING) CLUSTERED BY(parentId) SORTED BY(parentId) INTO 32 BUCKETS;

DROP TABLE IF EXISTS counterparty_bucket;
CREATE EXTERNAL TABLE counterparty_bucket(
Version STRING,AccountId STRING,childId STRING) 
CLUSTERED BY(childId ) SORTED BY(childId) INTO 32 BUCKETS;

The Join between the tables

SELECT /*+ MAPJOIN(counterparty_bucket) */ BookId , t.counterpartysdsid, c.sds  
FROM counterparty_bucket c join trades_bucket t 
on c.childId = t.parentId
where c.childId ='10001684'

The problem is that the join is producing Cartesian product out of the two tables. What I mean is if big table has 100 rows and small table has 4 rows for a given id, I expect the join to return 100 rows, but I am getting back 400 rows. Anyone have a clue or anyone witnessed similar situation?

Is your syntax right? Shouldn't you have `MAPJOIN(c)` instead of `MAPJOIN(counterparty)`? — Santiago Cepas, May 29 '14 at 07:33
@SantiagoCepas,It will work irrespective of whether you use alias or the exact table name, Even if I remove the entire MAPJOIN hint, the join always ends up with a Cartesian product. — rajibdotnet, May 29 '14 at 14:23
But, according to the table definition in your code, the exact table name is `counterparty_bucket` not `counterparty` — Santiago Cepas, May 29 '14 at 15:21

Hive join query returning Cartesian product on inner join

0 Answers0