Count all past occurences of an element in a large dataset

Question

I have a quite large dataframe (3 million rows) that looks like this:

df = pd.DataFrame({'user_id' : ['100','101','102','103','104'],
           'service_id' : ['73', '73', '46', '12', '12'],
           'date_of_service' : ['2015-06-10 17:00:00', 
                               '2014-09-27 17:00:00', 
                               '2015-01-12 17:00:00', 
                               '2012-08-22 17:00:00', 
                               '2013-03-07 17:00:00']})
df

I have a column for user id. The users are providing services, each services having an id (service_id). Each service has a date (date_of_service).

I would like to create a column that computes, for each row, the number of services that the user has done in the past (including the present one), ie I would like the following result dataframe:

df = pd.DataFrame({'user_id' : ['100','101','102','103','104'],
           'service_id' : ['73', '73', '46', '12', '12'],
           'date_of_service' : ['2015-06-10 17:00:00', 
                               '2014-09-27 17:00:00', 
                               '2015-01-12 17:00:00', 
                               '2012-08-22 17:00:00', 
                               '2013-03-07 17:00:00'],
              'number_of_past_services' : [2, 1, 1, 1, 2]})
df

What I have done

I used a groupby and a count:

df['count_services'] = df.ix[:, 1:].groupby('user_id').transform('count')

The problem is that, here, I count all occurrences over the whole dataset. What I want is to have past occurrences!

I have tried to filter using sql-like operations, for example:

len(df[df.date_of_service < df['date_of_service'][0]][df.user_id == df.user_id[0]])

which gives the good result for the first row. However, this computation takes about 1 second for only one row!

I would like to know how to scale this so that I can create a column efficiently.

score 2 · Accepted Answer · answered Feb 05 '17 at 11:27

IIUC you can do it this way:

In [69]: df['number_of_past_services'] = df.sort_values('date_of_service') \
                                           .assign(x=1) \
                                           .groupby('service_id')['x'].cumsum()

In [70]: df
Out[70]:
      date_of_service service_id user_id  number_of_past_services
0 2015-06-10 17:00:00         73     100                        2
1 2014-09-27 17:00:00         73     101                        1
2 2015-01-12 17:00:00         46     102                        1
3 2012-08-22 17:00:00         12     103                        1
4 2013-03-07 17:00:00         12     104                        2

@PeterMartigny, you are welcome. Please consider [accepting](http://meta.stackexchange.com/a/5235) an answer if you think it has answered your question — MaxU, Feb 05 '17 at 16:44

score 0 · Answer 2 · edited May 23 '17 at 12:01

If I understood it correctly, you could :

sort your data by date_of_service
create an empty dictionary of past occurences
iterate over all the sorted rows:
- past_occurences.get(user_id, 0) holds the information you need
- past_occurences[user_id] = past_occurences.get(user_id, 0) + 1 will update the dictionary.

Of all steps, the sort will be the slowest. The rest should be reasonably fast.

PS: You could also use a defaultdict. Here's an example.

Count all past occurences of an element in a large dataset

2 Answers2