4

I have an array of n integers (not necessarily distinct!) and I would like to iterate over all subsets of size k. However I'd like to exclude all duplicate subsets.

e.g.

array = {1,2,2,3,3,3,3}, n = 7, k = 2

then the subsets I want to iterate over (each once) are:

{1,2},{1,3},{2,2},{2,3},{3,3}

What is an efficient algorithm for doing this? Is a recursive approach the most efficient/elegant?

In case you have a language-specific answer, I'm using C++.

Alex
  • 420
  • 4
  • 17
  • Why can't you uniquify the original array first and then just use your standard solution to enumerate all subsets? – Kerrek SB May 28 '15 at 00:07
  • @KerrekSB That would drop `{2,2}` and `{3,3}`. – Barry May 28 '15 at 00:07
  • @KerrekSB wouldn't I miss out {2,2} and {3,3}? EDIT: oh you were quicker. Also downvoters. what is wrong with my question? – Alex May 28 '15 at 00:08
  • @Alex: I see, good point. – Kerrek SB May 28 '15 at 00:08
  • Hm, off-topic, belongs on [CS](http://cs.stackexchange.com/)? – Kerrek SB May 28 '15 at 00:14
  • @KerrekSB Sorry, I wasn't aware of that website. If no moderator moves this question, I'll delete it and repost it over there. – Alex May 28 '15 at 00:18
  • I don't think it's entirely off-topic, because it *could* have been covered partly or completely by a standard library algorithm. E.g. for a slightly different question `std::next_permutation` could have been part of an answer. The answer to this question involves counting in a restricted way (increasing digit sequences only), but I don't think the standard library helps with that. – Cheers and hth. - Alf May 28 '15 at 00:21
  • Maybe duplicate of this? http://stackoverflow.com/questions/127704/algorithm-to-return-all-combinations-of-k-elements-from-n – Barry May 28 '15 at 00:25

4 Answers4

4

The same (or almost the same) algorithm which is used to generated combinations of a set of unique values in lexicographical order can be used to generate combinations of a multiset in lexicographical order. Doing it this way avoids the necessity to deduplicate, which is horribly expensive, and also avoids the necessity of maintaining all the generated combinations. It does require that the original list of values be sorted.

The following simple implementation finds the next k-combination of a multiset of n values in average (and worst-case) time O(n). It expects two ranges: the first range is a sorted k-combination, and the second range is the sorted multiset. (If either range is unsorted or the values in first range do not constitute a sub(multi)set of the second range, then the behaviour is undefined; no sanity checks are made.)

Only the end iterator from the second range is actually used, but I thought that made the calling convention a bit odd.

template<typename BidiIter, typename CBidiIter,
         typename Compare = std::less<typename BidiIter::value_type>>
int next_comb(BidiIter first, BidiIter last,
              CBidiIter /* first_value */, CBidiIter last_value,
              Compare comp=Compare()) {
  /* 1. Find the rightmost value which could be advanced, if any */
  auto p = last;
  while (p != first && !comp(*(p - 1), *--last_value)) --p;
  if (p == first) return false;
  /* 2. Find the smallest value which is greater than the selected value */
  for (--p; comp(*p, *(last_value - 1)); --last_value) { }
  /* 3. Overwrite the suffix of the subset with the lexicographically smallest
   *    sequence starting with the new value */
  while (p != last) *p++ = *last_value++;
  return true;
}

It should be clear that steps 1 and 2 combined make at most O(n) comparisons, because each of the n values is used in at most one comparison. Step 3 copies at most O(k) values, and we know that kn.

This could be improved to O(k) in the case where no values are repeated, by maintaining the current combination as a container of iterators into the value list rather than actual values. This would also avoid copying values, at the cost of extra dereferences. If in addition we cache the function which associates each value iterator with an iterator to the first instance of next largest value, we could eliminate Step 2 and reduce the algorithm to O(k) even for repeated values. That might be worthwhile if there are a large number of repeats and comparisons are expensive.

Here's a simple use example:

std::vector<int> values = {1,2,2,3,3,3,3};
/* Since that's sorted, the first subset is just the first k values */
const int k = 2;
std::vector<int> subset{values.cbegin(), values.cbegin() + k};

/* Print each combination */
do {
  for (auto const& v : subset) std::cout << v << ' ';
  std::cout << '\n';
} while (next_comb(subset.begin(),  subset.end(),
                   values.cbegin(), values.cend()));

Live on coliru

rici
  • 201,785
  • 23
  • 193
  • 283
  • Thanks. I ended up caching the indices of the next largest integers, as you suggested. I like that this doesn't depend on a 'set'. – Alex May 29 '15 at 15:04
3

I like bit-twiddling for this problem. Sure, it limits you to only 32 elements in your vector, but it's still cool.

First, given a bit mask, determine the next bitmask permutation (source):

uint32_t next(uint32_t v) {
    uint32_t t = v | (v - 1);
    return (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));  
}

Next, given a vector and a bitmask, give a new vector based on that mask:

std::vector<int> filter(const std::vector<int>& v, uint32_t mask) {
    std::vector<int> res;
    while (mask) {
        res.push_back(v[__builtin_ctz(mask)]);
        mask &= mask - 1;
    }   
    return res;
}

And with that, we just need a loop:

std::set<std::vector<int>> get_subsets(const std::vector<int>& arr, uint32_t k) {   
    std::set<std::vector<int>> s;
    uint32_t max = (1 << arr.size());
    for (uint32_t v = (1 << k) - 1; v < max; v = next(v)) {
        s.insert(filter(arr, v));
    }
    return s;
}

int main()
{
    auto s = get_subsets({1, 2, 2, 3, 3, 3, 3}, 2);
    std::cout << s.size() << std::endl; // prints 5
}
Barry
  • 247,587
  • 26
  • 487
  • 819
  • Insert comment about using bitmask to iterate quickly, but then sticking in results in a `set` anyway here. – Barry May 28 '15 at 00:22
  • Pretty cool, seems to work! I don't think I need more than 32 elements anyway. I guess this requires that my array is sorted beforehand (in case it isn't)? – Alex May 28 '15 at 00:31
  • @Alex Er, I guess the de-duping logic is quite right. Anyway, the only good thing about this solution is that it's cool. You can definitely do way better. – Barry May 28 '15 at 00:31
1

Unlike the previous answer, this is not as efficient and doesn't do anything as fancy as a lot of the bit twiddling. However it does not limit the size of your array or the size of the subset.

This solution uses std::next_permutation to generate the combinations, and takes advantage of std::set's uniqueness property.

#include <algorithm>
#include <vector>
#include <set>
#include <iostream>
#include <iterator>

using namespace std;

std::set<std::vector<int>> getSubsets(const std::vector<int>& vect, size_t numToChoose)
{
    std::set<std::vector<int>> returnVal;
    // return the whole thing if we want to
    // choose everything 
    if (numToChoose >= vect.size())
    {
        returnVal.insert(vect);
        return returnVal;
    }

    // set up bool vector for combination processing
    std::vector<bool> bVect(vect.size() - numToChoose, false);

    // stick the true values at the end of the vector
    bVect.resize(bVect.size() + numToChoose, true); 

    // select where the ones are set in the bool vector and populate
    // the combination vector
    do
    {
        std::vector<int> combination;
        for (size_t i = 0; i < bVect.size() && combination.size() <= numToChoose; ++i)
        {
            if (bVect[i])
                combination.push_back(vect[i]);
        }
        // sort the combinations
        std::sort(combination.begin(), combination.end());

        // insert this new combination in the set
        returnVal.insert(combination);
    } while (next_permutation(bVect.begin(), bVect.end()));
    return returnVal;
}

int main()
{
    std::vector<int> myVect = {1,2,2,3,3,3,3};

    // number to select
    size_t numToSelect = 3;

    // get the subsets
    std::set<std::vector<int>> subSets = getSubsets(myVect, numToSelect);

    // output the results
    for_each(subSets.begin(), subSets.end(), [] (const vector<int>& v) 
    { cout << "subset "; copy(v.begin(), v.end(), ostream_iterator<int>(cout, " ")); cout << "\n"; });
}

Live example: http://coliru.stacked-crooked.com/a/beb800809d78db1a

Basically we set up a bool vector and populate a vector with the values that correspond with the position of the true items in the bool vector. Then we sort and insert this into a set. The std::next_permutation shuffles the true values in the bool array around and we just repeat.

Admittedly, not as sophisticated and more than likely slower than the previous answer, but it should do the job.

PaulMcKenzie
  • 31,493
  • 4
  • 19
  • 38
1

The basic idea of this solution is a function like next_permutation but which generates the next ascending sequence of "digits". Here called ascend_ordered.

template< class It >
auto ascend_ordered( const int n_digits, const It begin, const It end )
    -> bool
{
    using R_it = reverse_iterator< It >;
    const R_it r_begin  = R_it( end );
    const R_it r_end    = R_it( begin );

    int max_digit = n_digits - 1;
    for( R_it it = r_begin ; it != r_end; ++it )
    {
        if( *it < max_digit )
        {
            ++*it;
            const int n_further_items = it - r_begin;
            for( It it2 = end - n_further_items; it2 != end; ++it2 )
            {
                *it2 = *(it2 - 1) + 1;
            }
            return true;
        }
        --max_digit;
    }
    return false;
}

Main program for the case at hand:

auto main() -> int
{
    vector<int> a = {1,2,2,3,3,3,3};
    assert( is_sorted( begin( a ), end( a ) ) );
    const int k = 2;
    const int n = a.size();
    vector<int> indices( k );
    iota( indices.begin(), indices.end(), 0 );      // Fill with 0, 1, 2 ...
    set<vector<int>> encountered;
    for( ;; )
    {
        vector<int> current;
        for( int const i : indices ) { current.push_back( a[i] ); }
        if( encountered.count( current ) == 0 )
        {
            cout << "Indices " << indices << " -> values " << current << endl;
            encountered.insert( current );
        }
        if( not ascend_ordered( n, begin( indices ), end( indices ) ) )
        {
            break;
        }
    }
}

Supporting includes and i/o:

#include <algorithm>
using std::is_sorted;

#include <assert.h>

#include <iterator>
using std::reverse_iterator;

#include <iostream>
using std::ostream; using std::cout; using std::endl;

#include <numeric>
using std::iota;

#include <set>
using std::set;

#include <utility>
using std::begin; using std::end;

#include <vector>
using std::vector;

template< class Container, class Enable_if = typename Container::value_type >
auto operator<<( ostream& stream, const Container& c )
    -> ostream&
{
    stream << "{";
    int n_items_outputted = 0;
    for( const int x : c )
    {
        if( n_items_outputted >= 1 ) { stream << ", "; }
        stream << x;
        ++n_items_outputted;
    }
    stream << "}";
    return stream;
}
Cheers and hth. - Alf
  • 135,616
  • 15
  • 192
  • 304
  • For {1,2,2,3,3,3,4} and k=3 it generates {1,2,3} twice. – Alex May 28 '15 at 08:02
  • Thanks! and sorry that was a bug. I erroneously had imagined that the sets would always be generated in ascending order, but that's only true for the indices... Fixed by keeping track of all sets encountered. – Cheers and hth. - Alf May 28 '15 at 08:38
  • Also fixed a problem with the headers: `std::iota` compiled with g++ by including only ``, but Visual C++ evidently follows the standard more exactly here, so it's necessary to include ``. – Cheers and hth. - Alf May 28 '15 at 08:45