Bucketing in R or SQL

Question

I am completely stumped on a problem and would like some guidance. I am picking random sets of 8 numbers from the set of 1 to 8 (for example, 5,6,8,1,3,4,2,7) and trying to bucket those numbers as subsets of sequential numbers according to the order they appear.

For the example above, the first bucket would start with a 5 then the 6 would be added. Upon hitting the 8 a new bucket would be started. Whenever we get to a number that belongs in an existing bucket (e.g., when we reach 2, it can be added to 1's bucket), we add it there. In this example, after all 8 numbers we'd arrive at:

5,6,7
8
1,2
3,4

For a total of 4 buckets.

I am not actually concerned with the contents of the buckets, I just want to count how many buckets there are for a given random set of 8 digits. I plan on looping through a set of 1000 of these 8 digit sequences.

I don't understand the logic behind this. 7 didn't appear in the first sequence. Neither 2 in the third. — David Arenburg, Jul 16 '15 at 18:25
So the idea is that we are going through digit by digit, creating a new bucket if a number isn't in sequence when any previous bucket. So 5 creates the first, 6 goes in it. 8 Creates the second bucket. 1 creates the third. 3 creates the 4th. Then 4,2 and 7 are appended to already existing buckets (because they are in sequence). Odd problem, yes, but kind of an interesting one. — Turtle Time, Jul 16 '15 at 18:37

Rorschach · Answer 1 · 2015-07-16T19:33:30.683

If you are just interested in the number of buckets,

## Your data
dat <- c( 5,6,8,1,3,4,2,7)

## Get the number of buckets
count <- 0
for (i in seq_along(dat))
    if (!((dat[i] - 1) %in% dat[1:i])) count <- count+1
count
# 4

and, more succinctly in a function

countBuckets <- function(lst) sum(sapply(1:length(lst), function(i)
    (!((lst[i]-1) %in% lst[1:i]))))

And, here is a recursive implementation to get the contents of buckets.

f <- function(lst, acc=NULL) {
    if (length(lst) == 0) return(acc)
    if (missing(acc)) return( Recall(lst[-1], list(lst[1])) )

    diffs <- sapply(acc, function(x) lst[1] - x[length(x)] == 1)
    if (any(diffs)) {
        acc[[which(diffs)]] <- c(acc[[which(diffs)]], lst[1])
    } else { acc <- c(acc, lst[1]) }
    return ( Recall(lst[-1], acc) )
}

f(dat)

# [[1]]
# [1] 5 6 7
# 
# [[2]]
# [1] 8
# 
# [[3]]
# [1] 1 2
# 
# [[4]]
# [1] 3 4

This answer is not getting its well deserved appreciation – David Arenburg Jul 16 '15 at 19:59 — David Arenburg, Jul 16 '15 at 19:59

score 5 · Answer 2 · answered Jul 16 '15 at 19:32

My solution, not ripped of from nongkrong but quite similar. You get the count of buckets:

x <- as.integer(c(5,6,8,1,3,4,2,7))
sum(is.na(sapply(1:length(x), function(i) which((x[i]-1L)==x[1:i])[1L])))
# [1] 4

I believe it is possible to vectorize it, then it would scale perfectly.

MichaelChirico · Answer 3 · 2016-03-22T00:37:40.663

4

Inspired by @jangorecki but quicker:

x <- sample(8L)
1 + sum(sapply(2L:8L, function(i) !any(x[i] - x[1:(i - 1L)] == 1)))

edited Mar 22 '16 at 00:37

answered Jul 16 '15 at 19:48

MichaelChirico

31,197
13
98
169

MichaelChirico · Answer 4 · 2016-03-22T00:37:08.710

3

Here's a vectorized answer:

ind.mat <- matrix(rep(1:8, each=8), ncol=8)
ind.mat[upper.tri(ind.mat)] <- NA
8 - sum(rowSums(matrix(rep(x, 8), ncol=8) - x[ind.mat] == 1, na.rm=TRUE))

Note that we only need to declare ind.mat once, so scales up well to replication.

edited Mar 22 '16 at 00:37

answered Jul 16 '15 at 20:19

MichaelChirico

31,197
13
98
169

1

Also it should get a much bigger speed up if OP can provide all 1000 cases and build a single matrix on them. In the most pessimistic implementation the case id would form an additional dimension in the array. – jangorecki Jul 16 '15 at 20:45

Brian · Answer 5 · 2015-07-16T18:37:09.717

2

I'm not too familiar with R, but you can definitely do something like:

setOf8 = your array of 8 numbers
buckets=0
for( i = [2,8] )
{
    if( (setOf8[i] < setOf8[i-1]) )
    {
        buckets = buckets + 1
    }
}

EDIT:

You could do something like:

func countBuckets( buckets, set )
{
    set = your array
    current = 1
    for( i = [2,size(set)] )
    {
        if( set[current] + 1 == set[i] )
        {
            set.remove( current )
            current = set[i-1]
        }
    }
    if( size(set) == 0 )
    {
        return buckets
    }
return countBuckets( buckets + 1, set )
}

edited Jul 16 '15 at 18:37

answered Jul 16 '15 at 18:08

Brian

1,598
8
17

Thanks. I've started with something like that, but the problem happens with numbers like that last 7, it's not in sequence, but it doesn't start a new bucket. It needs to get appended to the first bucket. – Turtle Time Jul 16 '15 at 18:19

score 2 · Answer 6 · answered Jul 16 '15 at 19:32

I'm not sure how it will fare on Oracle, but since you have added the SQL Server tag, here is a T-SQL solution:

declare @set char(8) = '56813427';

with cte as (
    select s.Id, cast(substring(@set, s.Id, 1) as int) as [Item]
    from dbo.Sequencer s
    where s.Id between 1 and 8
    union all
    select 9 as [Id], 0 as [Item]
)
select count(*) as [TotalBuckets]
from cte s
    inner join cte n on (s.Item = n.Item - 1) and s.Id > n.Id;

The idea behind it is to count the cases when next number goes before the current one, beginning a new bucket rather than continuing the current one. The only problem here is with boundaries, so I added trailing zero. Without it, least set item (1 in your case) is not counted as a separate bucket.

P.S. dbo.Sequencer is a table with incrementing integers. I usually keep one in the database to project ordered sequences.

Bucketing in R or SQL

6 Answers6