It seems you're looking for an histogram.
If you are looking for performance, go into the CUB or Thrust libraries as the two comments point out, otherwise you'll end up expending a lot of time and still not achieving those performance levels.
If you are decided to implement the histogram I'll recommend you to start with the simplest implementation; a two-step approach. In the first step you calculate the number of elements which falls into each bucket, so you can create the container structure with the right array sizes. The second step simply copy the elements to the corresponding array of the structure.
Since here, you can evolve to more complex versions, using for example a prefix sum to calculate the initial positions of the buckets on a large array.
The application is bounded by memory traffic (you have not arithmetic workload at all), so try to improve the locality and the access patterns as much as you can.
Of course, check out the open source code to get some ideas.