2

Suppose I have three Series

s1 = pd.Series(data=np.arange(5))
s2 = pd.Series(data=np.arange(5),index=np.arange(2,7))
s3 = pd.Series(data=np.arange(5),index=np.arange(5,10))

Which is the most efficient method to merge them into a series which sums data values in duplicated indexes, and keeps values in non-duplicated indexes. In other words, for s1, s2, s3 we would expect the output:

0    0
1    1
2    2
3    4
4    6
5    3
6    5
7    2
8    3
9    4

In reality we could have a large number of series with a large number of data entries so looping is not efficient.

mch56
  • 652
  • 2
  • 17

2 Answers2

4

Let's use concat and sum here:

pd.concat([s1, s2, s3]).sum(level=0)

0    0
1    1
2    2
3    4
4    6
5    3
6    5
7    2
8    3
9    4
dtype: int64

Concatenation is an O(n) operation (linear complexity), recommended 99% of the time.


If you like method chaining, this one's for you.

s1.append(s2).append(s3).sum(level=0)

0    0
1    1
2    2
3    4
4    6
5    3
6    5
7    2
8    3
9    4
dtype: int64

Not recommended for more than 3 series. Appending two at a time turns out to be quadratic in complexity.

cs95
  • 274,032
  • 76
  • 480
  • 537
  • Nice - I guess `concat` is dealing with the outer join under the hood. – mch56 Apr 16 '19 at 18:55
  • 1
    @ojlm This is technically not a Join operation as you are only concating vertically. the `join` argument would make a difference when merging them horizontally. Here's an exhaustive discussion on the topic: https://stackoverflow.com/questions/53645882/pandas-merging-101 – cs95 Apr 16 '19 at 18:58
3

Here's one with NumPy tools -

def sum_series(L): # L is list of series
    aL = [list(l.index) for l in L]
    intL,unqL = pd.factorize(np.concatenate(aL))
    sums = np.bincount(intL,np.concatenate(L))
    return pd.Series(sums, index=unqL)

Sample run -

In [158]: L = [s1,s2,s3] # list of series

In [159]: sum_series(L)
Out[159]: 
0    0.0
1    1.0
2    2.0
3    4.0
4    6.0
5    3.0
6    5.0
7    2.0
8    3.0
9    4.0
dtype: float64

With generic labels -

In [170]: L
Out[170]: 
[Label0    0
 Label1    1
 Label2    2
 Label3    3
 Label4    4
 dtype: int64, Label2    0
 Label3    1
 Label4    2
 Label5    3
 Label6    4
 dtype: int64, Label5    0
 Label6    1
 Label7    2
 Label8    3
 Label9    4
 dtype: int64]

In [171]: sum_series(L)
Out[171]: 
Label0    0.0
Label1    1.0
Label2    2.0
Label3    4.0
Label4    6.0
Label5    3.0
Label6    5.0
Label7    2.0
Label8    3.0
Label9    4.0
dtype: float64

Version #2

Using a concatenation on the array data and employing smart output dtype, probably a more desirable output could be obtained with something like this -

def sum_series_v2(L): # L is list of series
    aL = [list(l.index) for l in L]
    v = [l.values for l in L]
    intL,unqL = pd.factorize(np.concatenate(aL))
    sums = np.bincount(intL,np.concatenate(v))
    dtype = np.result_type(*[l.dtype for l in L])
    return pd.Series(sums.astype(dtype), index=unqL)

Sample run -

In [225]: sum_series_v2(L)
Out[225]: 
Label0    0
Label1    1
Label2    2
Label3    4
Label4    6
Label5    3
Label6    5
Label7    2
Label8    3
Label9    4
dtype: int64
Divakar
  • 204,109
  • 15
  • 192
  • 292