Using hive global scheduling

Question

when using hive like this:

 select req_time from ncsa where req_time > 90 sort by req_time limt 100;

you will find this：

I guess in mapps the date divided into several parts，and reduces sort by each part.

please tell me how to solve this problem?

I'm sorry,it should be "select req_time from ncsa where req_time > 90 sort by req_time limt 100;" and the result is not global sort by — caidao, Feb 20 '13 at 09:57

score 1 · Accepted Answer · answered Feb 23 '13 at 16:37

use order by instead of sort by.

The difference between order by and sort by is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. see hive docs for more details.

PS. make sure req_time is a numeric field.

score 0 · Answer 2 · edited May 23 '17 at 12:29

I'll quote answer from Hive cluster by vs order by vs sort by:

CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.
So CLUSTER BY - is basically the more scalable version of ORDER BY.

Using hive global scheduling

2 Answers2