23

I have about 30 GB of data (in a list of about 900 dataframes) that I am attempting to concatenate together. The machine I am working with is a moderately powerful Linux Box with about 256 GB of ram. However, when I try to concatenate my files I quickly run out of available ram. I have tried all sorts of workarounds to fix this (concatenating in smaller batches with for loops, etc.) but I still cannot get these to concatenate. Two questions spring to mind:

  1. Has anyone else dealt with this and found an effective workaround? I cannot use a straight append because I need the 'column merging' (for lack of a better word) functionality of the join='outer' argument in pd.concat().

  2. Why is Pandas concatenation (which I know is just calling numpy.concatenate) so inefficient with its use of memory?

I should also note that I do not think the problem is an explosion of columns as concatenating 100 of the dataframes together gives about 3000 columns whereas the base dataframe has about 1000.

Edit:

The data I am working with is financial data about 1000 columns wide and about 50,000 rows deep for each of my 900 dataframes. The types of data going across left to right are:

  1. date in string format,
  2. string
  3. np.float
  4. int

... and so on repeating. I am concatenating on column name with an outer join which means that any columns in df2 that are not in df1 will not be discarded but shunted off to the side.


Example:

 #example code
 data=pd.concat(datalist4, join="outer", axis=0, ignore_index=True)
 #two example dataframes (about 90% of the column names should be in common
 #between the two dataframes, the unnamed columns, etc are not a significant
 #number of the columns)

print datalist4[0].head()
                800_1     800_2   800_3  800_4               900_1     900_2  0 2014-08-06 09:00:00  BEST_BID  1117.1    103 2014-08-06 09:00:00  BEST_BID   
1 2014-08-06 09:00:00  BEST_ASK  1120.0    103 2014-08-06 09:00:00  BEST_ASK   
2 2014-08-06 09:00:00  BEST_BID  1106.9     11 2014-08-06 09:00:00  BEST_BID   
3 2014-08-06 09:00:00  BEST_ASK  1125.8     62 2014-08-06 09:00:00  BEST_ASK   
4 2014-08-06 09:00:00  BEST_BID  1117.1    103 2014-08-06 09:00:00  BEST_BID   

    900_3  900_4              1000_1    1000_2    ...     2400_4  0  1017.2    103 2014-08-06 09:00:00  BEST_BID    ...        NaN   
1  1020.1    103 2014-08-06 09:00:00  BEST_ASK    ...        NaN   
2  1004.3     11 2014-08-06 09:00:00  BEST_BID    ...        NaN   
3  1022.9     11 2014-08-06 09:00:00  BEST_ASK    ...        NaN   
4  1006.7     10 2014-08-06 09:00:00  BEST_BID    ...        NaN   

                      _1  _2  _3  _4                   _1.1 _2.1 _3.1  _4.1  0  #N/A Invalid Security NaN NaN NaN  #N/A Invalid Security  NaN  NaN   NaN   
1                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
2                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
3                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
4                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   

      dater  
0  2014.8.6  
1  2014.8.6  
2  2014.8.6  
3  2014.8.6  
4  2014.8.6  

[5 rows x 777 columns]

print datalist4[1].head()
                150_1     150_2   150_3  150_4               200_1     200_2  0 2013-12-04 09:00:00  BEST_BID  1639.6     30 2013-12-04 09:00:00  BEST_ASK   
1 2013-12-04 09:00:00  BEST_ASK  1641.8    133 2013-12-04 09:00:08  BEST_BID   
2 2013-12-04 09:00:01  BEST_BID  1639.5     30 2013-12-04 09:00:08  BEST_ASK   
3 2013-12-04 09:00:05  BEST_BID  1639.4     30 2013-12-04 09:00:08  BEST_ASK   
4 2013-12-04 09:00:08  BEST_BID  1639.3    133 2013-12-04 09:00:08  BEST_BID   

    200_3  200_4               250_1     250_2    ...                 2500_1  0  1591.9    133 2013-12-04 09:00:00  BEST_BID    ...    2013-12-04 10:29:41   
1  1589.4     30 2013-12-04 09:00:00  BEST_ASK    ...    2013-12-04 11:59:22   
2  1591.6    103 2013-12-04 09:00:01  BEST_BID    ...    2013-12-04 11:59:23   
3  1591.6    133 2013-12-04 09:00:04  BEST_BID    ...    2013-12-04 11:59:26   
4  1589.4    133 2013-12-04 09:00:07  BEST_BID    ...    2013-12-04 11:59:29   

     2500_2 2500_3 2500_4         Unnamed: 844_1  Unnamed: 844_2  0  BEST_ASK   0.35     50  #N/A Invalid Security             NaN   
1  BEST_ASK   0.35     11                    NaN             NaN   
2  BEST_ASK   0.40     11                    NaN             NaN   
3  BEST_ASK   0.45     11                    NaN             NaN   
4  BEST_ASK   0.50     21                    NaN             NaN   

  Unnamed: 844_3 Unnamed: 844_4         Unnamed: 848_1      dater  
0            NaN            NaN  #N/A Invalid Security  2013.12.4  
1            NaN            NaN                    NaN  2013.12.4  
2            NaN            NaN                    NaN  2013.12.4  
3            NaN            NaN                    NaN  2013.12.4  
4            NaN            NaN                    NaN  2013.12.4  

[5 rows x 850 columns]
sfortney
  • 1,746
  • 3
  • 18
  • 37
  • 2
    can you give more information in terms of: number of rows, columns per dataframe. what the types are, per field. what you're joining on. – gabe Apr 22 '15 at 19:22
  • Yes I will edit my question to answer those. – sfortney Apr 22 '15 at 19:24
  • 1
    @sfortney it will help if you could add a small, complete, runnable example showing how your code works. Then we will all be on the same page. You can hand-code or randomly generate say three or four small DataFrames and show exactly how you are concatenating them in the form of code. See this recent question for a similar example: http://stackoverflow.com/q/29629821/553404 – YXD Apr 22 '15 at 19:40
  • I have edited it with code examples. I will also link to a question that has some example output. – sfortney Apr 22 '15 at 19:50
  • Here is a link to another question (which I actually asked some time ago) that does exactly what I am talking about. http://stackoverflow.com/questions/28842681/concatenating-multiple-dataframes-with-non-standard-columns-python-pandas-solve – sfortney Apr 22 '15 at 19:54
  • 5
    In general, sequentially concatenating subarrays is a slow way to produce a single large array. Concatenating `a` and `b` involves allocating a new output array the size of both `a` and `b` combined. As your output array gets longer and longer, the cost of allocating each new array gets larger and larger. It's much better to pre-allocate a single array the size of your whole output array, then fill in the rows as you go along. – ali_m Apr 22 '15 at 20:01
  • 2
    Yep - see the similar case of StringBuilder classes http://www.yoda.arachsys.com/csharp/stringbuilder.html – YXD Apr 22 '15 at 20:13
  • 3
    Another pandas-specific point - you will definitely want to pass `copy=False` to `pd.concat()` to avoid generating unnecessary copies – ali_m Apr 22 '15 at 20:25
  • @ali_m That actually sounds like the ideal way to approach this problem. Do you have any sample code that does that? I know how I would put the data into the numpy array but I dont know how I would make the columns match up when adding in. – sfortney Apr 23 '15 at 16:35
  • @sfortney Hmm, I've never before had to deal with DataFrames large enough for this to become a problem. One thing I would say is that `pd.concat` might not be that bad as long as you do it for all your subarrays in a single operation. `np.concatenate` iterates over its inputs, works out how big the output array needs to be, then preallocates it, so in principle a single call to `pd.concat` ought to be able to take care of the preallocation without you having to do it manually. What's *really* bad is looping over your subarrays and concatenating them sequentially to build your array. – ali_m Apr 23 '15 at 22:18

2 Answers2

16

I've had performance issues concatenating a large number of DataFrames to a 'growing' DataFrame. My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed.

Alexander
  • 87,529
  • 23
  • 162
  • 169
  • 3
    This actually is my current workaround. It seems to work okay but I was wondering if there is an even better way. Thanks! – sfortney Apr 23 '15 at 16:36
  • 1
    This cut my run time in half. Thanks! – meepl Jan 11 '17 at 19:02
  • This does not work for me at all. When you have a list of data frames and you want to concatenate rows (add rows to the dataframe with the same cols), do you have to do pd.concat(my_list, axis=1)? – Konrad Oct 20 '17 at 14:09
  • @Konrad `axis=0` to stack them on top of each other. – Alexander Oct 20 '17 at 14:51
  • Me too! Appending to a growing DataFrame becomes progressively slower. But when the types are misaligned, e.g. concatenating int64 and float64, it grinds to a halt (about 100x slower). The workaround, like @Alexander, was to concat a list of frames. But I never managed to figure out why the append was slow. – user443854 May 14 '19 at 22:56
  • This idea doesn't work for me. I have almost 6 million rows, which I split into 1,000 rows chunks, according to this: http://deo.im/2016/09/22/Load-data-from-mongodb-to-Pandas-DataFrame/. It runs fine, but when I get to the concat point it simply stucks my computer. Any ideas? – Eli Borodach Jun 10 '19 at 08:25
  • @EliBorodach If your data is truly too big for your computer, consider using Dask (https://dask.org/). – Alexander Jun 10 '19 at 15:25
2

Looks like you are trying to row-wise concat, even though you text indicates that you what column-wise. Specify axis=1.

Other points to consider:

copy=False flag will not help at all; this only matters if you are not concatting blocks of the same dtype (which you indicated you are).

pd.concat does use np.concatenate under the hood. If you think you can do better, then go for it.

def make_frames(n=100, rows=100, cols=100):
    return [ pd.DataFrame(np.random.randn(rows,cols),columns=np.random.choice(110,100,replace=False)) for i in xrange(n) ]

In [28]: l = make_frames(rows=10000)

In [29]: l[0].head()
Out[29]: 
        60        75        101       103       87        29        10        106       71        26        30        83        2         28        99        85        88        62        58        18        42        1         105       25        34     ...          102       27        22   \
0 -0.854117 -0.007549 -0.510359 -0.993757  0.877635 -0.303199 -1.488548  1.179360  0.578095  0.807792  0.169930 -1.781403  0.204696 -0.515057 -0.954246  1.106073  0.666516 -1.146988  1.335709  0.362838 -0.675379  1.483469  0.670385 -0.483312 -0.703795    ...     1.322645 -1.942183  1.053502   
1  2.057542  0.860946 -0.037665 -0.347265  0.152562 -0.859537  1.431045  1.306419  0.623013  1.192325  0.909597  1.710507  1.319330 -0.402874  1.749581  1.223489  0.036354  0.140255  0.844330 -0.091447 -0.347245  0.259055  1.187882 -0.216858 -1.421336    ...     1.122068  0.887538  0.205854   
2 -0.077974  0.947503  0.688666  0.288104 -1.275329 -0.840847 -2.014090 -1.318507 -0.889416 -0.098005  0.055492  0.847597 -1.289428 -0.910093  0.201312 -1.699879  0.103062 -1.041608  0.379171 -1.089937  0.894626 -1.500215 -0.501182  0.042078 -0.840789    ...     0.539192  0.193256  0.196138   
3  0.291993  1.138577  1.061509  0.856553  1.118931  0.725806 -0.689776  1.337957 -1.009835 -0.976506 -0.392317  0.295876  0.092240  0.418201  0.473585  0.013809 -1.169947  0.424797  0.019051 -0.526189  0.066991 -0.268750  1.277004 -0.736560 -0.314987    ...     0.272045 -0.333272  0.573267   
4 -2.073985 -0.016950 -1.712770  0.286212 -0.159693 -0.495864  1.286450 -1.168880  1.031456 -3.080568  1.443880 -0.604405  0.406383 -0.162986  1.077255  1.160726  0.943949 -1.517681 -1.049972  1.208850 -0.859617 -0.145358 -0.638898  0.248012 -2.985845    ...    -0.699697  0.051352  0.575304   

        69        76        91        45        14        37        0         81        38        72        107       11        5         73        70        8         90        94        53        3         55        12   
0 -0.972965 -0.298674  1.283482  2.344092 -0.597735 -0.407978  0.971726 -0.935620  0.236889 -0.957096 -2.366399 -0.943760  0.293325 -0.240385 -0.392554 -0.887556  0.261402 -2.050122 -1.776865 -1.513899 -0.953916  0.630495  
1 -1.471033  0.269830 -0.744507 -0.982779  0.624527 -1.782704  1.197262 -0.297730  1.122939 -1.039226  0.171351 -0.828985  0.698245  0.563430  0.718177  0.682369  1.415918  0.049931  0.648000  1.785455 -0.190021 -1.329753  
2 -1.942792  0.560981 -0.353782 -1.637407 -1.495131 -0.593041 -1.617116 -0.910257 -0.506877  0.178378 -0.623986  0.302544  0.279309 -0.266409  0.780306  0.986510 -1.549847  0.063632 -0.480434  1.393221 -1.237682  1.577320  
3  0.468151 -1.002872 -0.147329 -0.420609  0.183696  0.527632  0.018911 -2.059989  1.642613 -0.428345  1.350693 -1.323321 -0.247263  0.331525 -2.036862 -2.593575  0.362101 -0.184095  0.419231 -0.633878  0.097499 -0.026044  
4 -0.581330 -0.848421 -0.682027 -1.260004 -0.357354 -0.304743  0.409537 -1.189925 -0.609352 -0.610345 -0.798009  0.219822 -0.681764  1.872736  1.738017  0.439148  1.012881 -0.934613 -1.007427 -0.390359  0.329949  0.486906  

[5 rows x 100 columns]

Concat, note using axis=1 as this is column-wise concat.

In [31]: df = pd.concat(l,axis=1,ignore_index=True)

In [32]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Columns: 10000 entries, 0 to 9999
dtypes: float64(10000)
memory usage: 763.0 MB

Timings

In [33]: %timeit pd.concat(l,axis=1,ignore_index=True)
1 loops, best of 3: 1.15 s per loop

In [34]: %memit pd.concat(l,axis=1,ignore_index=True)
peak memory: 2390.25 MiB, increment: 651.28 MiB
Jeff
  • 108,421
  • 19
  • 199
  • 170
  • Actually axis=0 is the correct choice. In the above comments I posted a link that demonstrates the desired output. I also did some tests with both options to make sure and axis=0 is the right choice. Sorry if that wasn't clear. – sfortney Apr 23 '15 at 17:02
  • for effiecient memory utilization you can use category type for all the dtype objects. – rajeshcis Aug 21 '17 at 11:19
  • @rajeshcis not entirely true. If there is less than 50% of them are duplicated I wouldn't convert them to categorical – Cyzanfar Feb 11 '20 at 03:03