0

I would like to know what is the major difference between Cluster By and CLUSTERED BY in hive.

Cluster By used for bucketing the table. And it will use the Hash function.

CLUSTERED BY used for order by value with in the reducer.

is there any other difference between.

Please let me know

Thanks

venkatbala.

Venkadesh Venkat
  • 165
  • 1
  • 6
  • 16

2 Answers2

4

"clustered by" only distributes your keys into different buckets, "cluster by" ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. The major difference is about sorting.

Mobin Ranjbar
  • 1,194
  • 1
  • 11
  • 23
1

In DDLs ( CREATE Statements ) -- past form is used like ( Partitioned By, Clustered By, Distributed By, Sorted By)

In DMLs ( like SELECT statements ) -- present form is used like ( Partition By, Cluster By, Distribute By, Sort By )

This is the only difference. Don't mix up sorting/bucketing complexities in it.

To understand the difference between Clustered By, Distributed By and Sorted by, refer to this link: Hive cluster by vs order by vs sort by

Nikhil
  • 41
  • 3