2

I am writing a simulation that creates 10,000 periods of 25 sets, with each set consisting of 48 coin tosses. Something in this code is making it run very slowly. It has been running for at least 20 minutes and it is still working. A similar simulation in R runs in under 10 seconds.

Here is the python code I am using:

import pandas as pd
from random import choices

threshold=17
all_periods = pd.DataFrame()

for i in range(10000):
    simulated_period = pd.DataFrame()
    for j in range(25):
        #Data frame with 48 weeks as rows. Each run through loop adds one more year as column until there are 25
        simulated_period = pd.concat([simulated_period, pd.DataFrame(choices([1, -1], k=48))],\
                                      ignore_index=True, axis=1)
        positives = simulated_period[simulated_period==1].count(axis=1)
        negatives = simulated_period[simulated_period==-1].count(axis=1)
        #Combine positives and negatives that are more than the threshold into single dataframe
        sig = pd.DataFrame([[sum(positives>=threshold), sum(negatives>=threshold)]], columns=['positive', 'negative'])
        sig['total'] = sig['positive'] + sig['negative']
    #Add summary of individual simulation to the others
    all_periods = pd.concat([all_periods, sig])

If it helps, here is the R script that is running quickly:

flip <- function(threshold=17){
  #threshold is min number of persistent results we want to see. For example, 17/25 positive or 17/25 negative

  outcomes <- c(1, -1)
  trial <- do.call(cbind, lapply(1:25, function (i) sample(outcomes, 48, replace=T)))
  trial <- as.data.frame(t(trial)) #48 weeks in columns, 25 years in rows.

  summary <- sapply(trial, function(x) c(pos=length(x[x==1]), neg=length(x[x==-1])))
  summary <- as.data.frame(t(summary)) #use data frame so $pos/$neg can be used instead of [1,]/[2,]

  sig.pos <- length(summary$pos[summary$pos>=threshold])
  sig.neg <- length(summary$neg[summary$neg>=threshold])

  significant <- c(pos=sig.pos, neg=sig.neg, total=sig.pos+sig.neg) 

  return(significant)
}

  results <- do.call(rbind, lapply(1:10000, function(i) flip(threshold)))
  results <- as.data.frame(results)

Can anyone tell me what I'm running in python that is slowing the process down? Thank you.

mks212
  • 773
  • 1
  • 15
  • 35
  • Are the lines between `positives = ` and `sig['total'] = ` really meant to be within the `for j in range(25)` loop? – ASGM Jul 16 '19 at 17:08
  • The major slow down is almost certainly the concats within a loop: `simulated_period = pd.concat([simulated_period..` This makes unnecessary copies and is O(N^2). Typically you'd append to a list within the loop and concat once at the end. – ALollz Jul 16 '19 at 17:40

1 Answers1

3

Why don't you generate the whole big set

idx = pd.MultiIndex.from_product((range(10000), range(25)),
                                 names=('period', 'set'))
df = pd.DataFrame(data=np.random.choice([1,-1], (10000*25, 48)), index=idx)

Took about 120ms on my computer. And then the other operations:

positives = df.eq(1).sum(level=0).gt(17).sum(axis=1).to_frame(name='positives')
negatives = df.eq(-1).sum(level=0).gt(17).sum(axis=1).to_frame(name='negatives')

all_periods = pd.concat( (positives, negatives), axis=1 )

all_periods['total'] = all_periods.sum(1)

take about 600ms extra.

Quang Hoang
  • 117,517
  • 10
  • 34
  • 52
  • Thank you Quang, this is much better than my solution. I believe there is one error in your code, the last line should read: `all_periods['total'] = all_periods.sum(1)`. Replace `new_df` with `all_periods`. Thank you again. – mks212 Jul 16 '19 at 18:42