2

when using hive like this:

 select req_time from ncsa where req_time > 90 sort by req_time limt 100;

you will find this:

958
952
951 
97
96
96
959
957
956
955 
955
953
95
94
92

I guess in mapps the date divided into several parts,and reduces sort by each part.

please tell me how to solve this problem?

caidao
  • 33
  • 4

2 Answers2

1

use order by instead of sort by.

The difference between order by and sort by is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. see hive docs for more details.

PS. make sure req_time is a numeric field.

pensz
  • 1,773
  • 1
  • 12
  • 15
0

I'll quote answer from Hive cluster by vs order by vs sort by:

  • CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

  • So CLUSTER BY - is basically the more scalable version of ORDER BY.

Community
  • 1
  • 1
Bohdan
  • 13,719
  • 13
  • 68
  • 66