3

I have a quite large dataframe (3 million rows) that looks like this:

df = pd.DataFrame({'user_id' : ['100','101','102','103','104'],
           'service_id' : ['73', '73', '46', '12', '12'],
           'date_of_service' : ['2015-06-10 17:00:00', 
                               '2014-09-27 17:00:00', 
                               '2015-01-12 17:00:00', 
                               '2012-08-22 17:00:00', 
                               '2013-03-07 17:00:00']})
df

I have a column for user id. The users are providing services, each services having an id (service_id). Each service has a date (date_of_service).

I would like to create a column that computes, for each row, the number of services that the user has done in the past (including the present one), ie I would like the following result dataframe:

df = pd.DataFrame({'user_id' : ['100','101','102','103','104'],
           'service_id' : ['73', '73', '46', '12', '12'],
           'date_of_service' : ['2015-06-10 17:00:00', 
                               '2014-09-27 17:00:00', 
                               '2015-01-12 17:00:00', 
                               '2012-08-22 17:00:00', 
                               '2013-03-07 17:00:00'],
              'number_of_past_services' : [2, 1, 1, 1, 2]})
df

What I have done

I used a groupby and a count:

df['count_services'] = df.ix[:, 1:].groupby('user_id').transform('count')

The problem is that, here, I count all occurrences over the whole dataset. What I want is to have past occurrences!

I have tried to filter using sql-like operations, for example:

len(df[df.date_of_service < df['date_of_service'][0]][df.user_id == df.user_id[0]])

which gives the good result for the first row. However, this computation takes about 1 second for only one row!

I would like to know how to scale this so that I can create a column efficiently.

2 Answers2

2

IIUC you can do it this way:

In [69]: df['number_of_past_services'] = df.sort_values('date_of_service') \
                                           .assign(x=1) \
                                           .groupby('service_id')['x'].cumsum()

In [70]: df
Out[70]:
      date_of_service service_id user_id  number_of_past_services
0 2015-06-10 17:00:00         73     100                        2
1 2014-09-27 17:00:00         73     101                        1
2 2015-01-12 17:00:00         46     102                        1
3 2012-08-22 17:00:00         12     103                        1
4 2013-03-07 17:00:00         12     104                        2
MaxU
  • 173,524
  • 24
  • 290
  • 329
  • @PeterMartigny, you are welcome. Please consider [accepting](http://meta.stackexchange.com/a/5235) an answer if you think it has answered your question – MaxU Feb 05 '17 at 16:44
0

If I understood it correctly, you could :

  • sort your data by date_of_service
  • create an empty dictionary of past occurences
  • iterate over all the sorted rows:
    • past_occurences.get(user_id, 0) holds the information you need
    • past_occurences[user_id] = past_occurences.get(user_id, 0) + 1 will update the dictionary.

Of all steps, the sort will be the slowest. The rest should be reasonably fast.

PS: You could also use a defaultdict. Here's an example.

Community
  • 1
  • 1
Eric Duminil
  • 48,038
  • 8
  • 56
  • 100