0

I am doing inner join on two tables that are created using Hive. One is a big table "trades_bucket" and another is a small table "counterparty_bucket". They are created as follows :-

DROP TABLE IF EXISTS trades_bucket;
CREATE EXTERNAL TABLE trades_bucket(
parentId STRING,
BookId STRING) CLUSTERED BY(parentId) SORTED BY(parentId) INTO 32 BUCKETS;

DROP TABLE IF EXISTS counterparty_bucket;
CREATE EXTERNAL TABLE counterparty_bucket(
Version STRING,AccountId STRING,childId STRING) 
CLUSTERED BY(childId ) SORTED BY(childId) INTO 32 BUCKETS;

The Join between the tables

SELECT /*+ MAPJOIN(counterparty_bucket) */ BookId , t.counterpartysdsid, c.sds  
FROM counterparty_bucket c join trades_bucket t 
on c.childId = t.parentId
where c.childId ='10001684'

The problem is that the join is producing Cartesian product out of the two tables. What I mean is if big table has 100 rows and small table has 4 rows for a given id, I expect the join to return 100 rows, but I am getting back 400 rows. Anyone have a clue or anyone witnessed similar situation?

rajibdotnet
  • 1,296
  • 2
  • 15
  • 27

0 Answers0