0

So I've run into a problem in some code I'm writing in c++. I need to find the median of an array of points with an offset and step size (example).

This code will be executed millions of times in as it's part of one of my core data structures so I'm trying to make it as fast as possible.

Research has led me to believe that for the best worst-case time complexity, introselect is the fastest way to find a median in a set of unordered values. I have some additional limitations that have to due with optimization:

  1. I can't swap any values in the array. The values in the array are all exactly where they need to be based on that context in the program. But I still need the median.

  2. I can't make any "new" allocations or call anything that does heap allocation. Or if I have to, then they need to be at a minimum as they are costly.

I've tried implementing the following in C++: First Second Third

Is this possible? Or are there alternatives that are just as fast at finding the median and fit those requirements?

Community
  • 1
  • 1
TheRobotCarlson
  • 302
  • 1
  • 7
  • Is this useful for you?: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-design-and-analysis-of-algorithms-spring-2012/lecture-notes/MIT6_046JS12_lec01.pdf – lorro Jun 23 '16 at 19:14
  • @lorro I've taken a look at that before and have the select from that implemented, but I can't figure out a way to perform the partition step without swaps or allocating space on the heap. – TheRobotCarlson Jun 23 '16 at 19:26
  • Are you actually doing a selection, or do you just need the median from an array of numbers? I don't see how you can select items without either rearranging the array to put the selected items at the front, or allocating some memory in which to return the values. – Jim Mischel Jun 23 '16 at 19:41
  • @JimMischel I just need the median from an unsorted array of numbers without changing the original array and with minimum allocations. If I have no alternative than to allocate space for a single result array, then I'd like to minimize the size of it as much as possible. – TheRobotCarlson Jun 23 '16 at 19:52
  • What's the data type of the keys ? Do you known anything about their distribution ? – Yves Daoust Jun 24 '16 at 14:38
  • @YvesDaoust They'll be floats (most of the time). Assume a random distribution. In one particular application of this function, the data will be an array with an implicit kd-tree. But the addition of new points to the end of that continuous piece of memory is what causes us to need the median for for a dimension for some region of points for rebalancing. – TheRobotCarlson Jun 24 '16 at 16:45
  • If the distribution is uniform enough, it might be interesting to consider a bucketing approach, counting the numbers in regular subintervals of the domain in order to select a subset of numbers containing the median. – Yves Daoust Jun 24 '16 at 16:52
  • Regarding memory allocation, you can preallocate global or heap arrays of the maximum problem size, once for all. – Yves Daoust Jun 24 '16 at 16:53
  • @YvesDaoust If I understand you correctly, there isn't a maximum problem size (or at least theoretically, there are machine limitations), I need to be able to work with any amount of numbers. Allocating the space to the head before the function call still causes the problem I'm trying to avoid. (I'm trying to maintain good cache hit/miss ratios) – TheRobotCarlson Jun 24 '16 at 17:17
  • Preallocation won't influence the cache, why should it ? You know your maximum problem size. – Yves Daoust Jun 24 '16 at 17:18

1 Answers1

0

You could consider using the same heap allocation for all operations, and avoid freeing it until you're done. This way, rather than creating millions of arrays you just create one.

Of course this approach is more complex if you're doing these find-median operations in parallel. You'd need one array per thread.

Chris
  • 552
  • 4
  • 12
  • In the future the dividing recursive operations might be parallelized but just acting on different areas of the array. An easy solution would just be making a single copy of the array that I could work with throughout the whole algorithm, but that'll decrease my cache hit/miss ratio, which is one of the big things that's been considered throughout this library. – TheRobotCarlson Jun 23 '16 at 19:49