1116

I understand that pandas is designed to load fully populated DataFrame but I need to create an empty DataFrame then add rows, one by one. What is the best way to do this ?

I successfully created an empty DataFrame with :

res = DataFrame(columns=('lib', 'qty1', 'qty2'))

Then I can add a new row and fill a field with :

res = res.set_value(len(res), 'qty1', 10.0)

It works but seems very odd :-/ (it fails for adding string value)

How can I add a new row to my DataFrame (with different columns type) ?

cs95
  • 274,032
  • 76
  • 480
  • 537
PhE
  • 12,544
  • 3
  • 18
  • 18
  • 77
    Note this is a very inefficient way to build a large DataFrame; new arrays have to be created (copying over the existing data) when you append a row. – Wes McKinney May 23 '12 at 13:46
  • 5
    @WesMcKinney: Thx, that's really good to know. Is it very fast to add *columns* to huge tables? – max Aug 28 '12 at 04:27
  • 4
    If it is too inefficient for you, you may preallocate an additional row and then update it. – user1154664 Apr 19 '13 at 19:54
  • 13
    Hey you... yes, **you**... I see what you're up to... you want to run this inside a loop and iteratively add rows to an empty DataFrame, don't you... well, [don't!](https://stackoverflow.com/a/56746204/4909087) – cs95 Jul 13 '20 at 12:52
  • 2
    I might understand this can be in general wrong but, what about real-time processing? So say I have some data that comes in every second and I have a thread that just wants to fill a dataframe and have another even-based thread that goes and look at the dataframe? I find this use case valid and where that solution is applicable to – Giuseppe Nov 20 '20 at 17:24
  • 1
    @Giuseppe then pandas DataFrame is not the right data structure, since it's designed for analysis, not for real-time processing. – justhalf Dec 14 '20 at 02:49
  • Thanks @justhalf, I guess it's because of the way it is implemented it won't be suitable for such kind of scenarios. Digging around I found this https://matthewrocklin.com/blog/work/2017/10/16/streaming-dataframes-1 , I am not sure if it's the right solution but I will surely give it a shot – Giuseppe Jan 20 '21 at 13:17

30 Answers30

760

You can use df.loc[i], where the row with index i will be what you specify it to be in the dataframe.

>>> import pandas as pd
>>> from numpy.random import randint

>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>>     df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))

>>> df
     lib qty1 qty2
0  name0    3    3
1  name1    2    4
2  name2    2    8
3  name3    2    1
4  name4    9    6
fred
  • 8,377
  • 1
  • 22
  • 31
  • 32
    Consider adding the index to preallocate memory (see my answer) – FooBar Jul 23 '14 at 14:22
  • 55
    `.loc` is referencing the index column, so if you're working with a pre-existing DataFrame with an index that isn't a continous sequence of integers starting with 0 (as in your example), `.loc` will overwrite existing rows, or insert rows, or create gaps in your index. A more robust (but not fool-proof) approach for appending an existing nonzero-length dataframe would be: `df.loc[df.index.max() + 1] = [randint(...` or prepopulating the index as @FooBar suggested. – hobs Sep 25 '15 at 23:21
  • 4
    @hobs `df.index.max()` is `nan` when the DataFrame is empty. – flow2k Apr 24 '19 at 01:30
  • 2
    @flow2k good catch! Only solution I can think of is a try accept (on the first row insertion only) with a pd.DataFrame() constructor call. Do you know any better ways? – hobs Apr 24 '19 at 21:31
  • 11
    @hobs One solution I thought of is using the ternary operator: `df.loc[0 if pd.isnull(df.index.max()) else df.index.max() + 1]` – flow2k Apr 25 '19 at 21:17
  • `df.loc[i, "column_name"] = new_value` will also fill in a specific field, if you just need to pad out the last value of a column so that another operation will carry through (eg. for plotting a `.cumsum()`) – StackG Aug 11 '20 at 08:20
  • df.loc[len(df.index)] = row – alex Mar 17 '21 at 08:45
636

In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame:

  1. Create a list of dictionaries in which each dictionary corresponds to an input data row.
  2. Create a data frame from this list.

I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds.

rows_list = []
for row in input_rows:

        dict1 = {}
        # get input row in dictionary format
        # key = col_name
        dict1.update(blah..) 

        rows_list.append(dict1)

df = pd.DataFrame(rows_list)               
SergiyKolesnikov
  • 5,453
  • 1
  • 18
  • 38
ShikharDua
  • 7,441
  • 1
  • 18
  • 21
  • 62
    I've moved to doing this as well for any situation where I can't get all the data up front. The speed difference is astonishing. – fantabolous Aug 13 '14 at 12:19
  • 63
    Copying from pandas docs: `It is worth noting however, that concat (and therefore append) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.` (http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-objects) – thikonom Dec 25 '15 at 22:01
  • 7
    This works great! Except when I created the data frame, the columns names were all in the wrong order... – user5359531 Aug 09 '16 at 21:36
  • 5
    @user5359531 You can use ordered dict in that case – ShikharDua Aug 10 '16 at 20:31
  • 24
    @user5359531 You can manually specify the columns and the order will be preserved. pd.DataFrame(rows_list, columns=['C1', 'C2','C3']) will do the trick – Marcello Grechi Lins Jan 27 '17 at 22:26
  • 1
    @ShikharDua Can you explain how to structure the data, i.e. when you say # get input row in dictionary format # key = col_name what do you mean? my use case: I am fetching rows from a MySQL table and each `row` object is coming as a tuple containing 10 strings – avg Jan 11 '18 at 10:31
  • @avg I am guessing each tuple is supposed to be a row in pandas table. In that case, it can be an element if the row_list instead of dict1. And once all data is on form of [tuple1, tuple2, tuple 3, .....] then you can create a final dataset – ShikharDua Jan 29 '19 at 00:29
  • @avg, it's probably better to use the [read_sql](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html) method in that case. Then you don't have to worry about parsing it on your own. – Eric Ed Lohmar Jun 17 '19 at 16:10
  • What is the performance difference between a list of dicts (1 dict per row) and 1 dict with each value a list? (1 list per col) – qwr Dec 30 '19 at 01:16
  • 1
    I tried this in my code using Tuple and it is even faster than Dict. 3 times faster for 160 rows and 4 columns in my case. – Sumit Pokhrel Mar 03 '20 at 22:17
  • 1
    @thikonom Isn't the append function used here a Python list (not a pandas object), which has [O(1) amortized append() in CPython](https://wiki.python.org/moin/TimeComplexity)? This answer looks good to me. – Josiah Yoder Jun 25 '20 at 17:19
  • 1
    Why are we using dict.update() rather than subscripting as mentioned [here](https://stackoverflow.com/questions/15456158/python-dict-update-vs-subscript-to-add-a-single-key-value-pair)? – kevin_theinfinityfund Jul 14 '20 at 17:02
  • @the775 you can use whatever you want. – ShikharDua Jul 15 '20 at 00:54
  • @ShikharDua appreciate it. I mentioned it because the article I [tagged above](https://stackoverflow.com/questions/15456158/python-dict-update-vs-subscript-to-add-a-single-key-value-pair) mentions the runtime of the native `dict1[key] = value` is faster than the .update() method, and many of the answers are concerned with speed. – kevin_theinfinityfund Jul 16 '20 at 16:04
  • is this the way to do it for an existing dataframe by transposing the dataframe: https://stackoverflow.com/a/29815523/13865853 – mLstudent33 Nov 06 '20 at 02:47
  • actually isn't it faster doing `df_dict = df.to_dict('records')` ? This way you get dictionaries with column names, not an int for column number as keys. – mLstudent33 Nov 06 '20 at 03:00
316

You could use pandas.concat() or DataFrame.append(). For details and examples, see Merge, join, and concatenate.

Michael WS
  • 2,053
  • 3
  • 20
  • 40
NPE
  • 438,426
  • 93
  • 887
  • 970
  • 7
    Hi, so what is the answer for the methods using append() or concat(). I have the same problem, but still trying to figuring it out. – notilas Aug 20 '14 at 22:52
  • 136
    This is the right answer, but it isn't a very *good* answer (almost link only). – jwg May 18 '16 at 14:34
  • 5
    I think @fred's answer is more correct. IIUC the problem with this answer is that it needlessly copies the entire DataFrame every time a row is appended. Using the `.loc` mechanism that can be avoided, especially if you're careful. – Ken Williams Mar 16 '17 at 16:03
  • 7
    But if you want to use `DataFrame.append()`, you have to make sure your row data is also a DataFrame in the first place, not a list. – StayFoolish Sep 08 '17 at 12:46
294

It's been a long time, but I faced the same problem too. And found here a lot of interesting answers. So I was confused what method to use.

In the case of adding a lot of rows to dataframe I interested in speed performance. So I tried 4 most popular methods and checked their speed.

UPDATED IN 2019 using new versions of packages. Also updated after @FooBar comment

SPEED PERFORMANCE

  1. Using .append (NPE's answer)
  2. Using .loc (fred's answer)
  3. Using .loc with preallocating (FooBar's answer)
  4. Using dict and create DataFrame in the end (ShikharDua's answer)

Results (in secs):

|------------|-------------|-------------|-------------|
|  Approach  |  1000 rows  |  5000 rows  | 10 000 rows |
|------------|-------------|-------------|-------------|
| .append    |    0.69     |    3.39     |    6.78     |
|------------|-------------|-------------|-------------|
| .loc w/o   |    0.74     |    3.90     |    8.35     |
| prealloc   |             |             |             |
|------------|-------------|-------------|-------------|
| .loc with  |    0.24     |    2.58     |    8.70     |
| prealloc   |             |             |             |
|------------|-------------|-------------|-------------|
|  dict      |    0.012    |   0.046     |   0.084     |
|------------|-------------|-------------|-------------|

Also thanks to @krassowski for useful comment - I updated the code.

So I use addition through the dictionary for myself.


Code:

import pandas as pd
import numpy as np
import time

del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
    df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)

# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
    df2.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)

# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
    df3.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)

# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
    row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
    dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
    row_list.append(dict1)

df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)

P.S. I believe, my realization isn't perfect, and maybe there is some optimization.

Mikhail_Sam
  • 7,481
  • 10
  • 37
  • 71
  • 4
    The use of `df2.index.max()` for `.loc` needlessly increases computational complexity. Simple `df2.loc[i] = ...` would do. For me it reduced the time from 10s to 8.64s – krassowski Jan 23 '19 at 20:44
  • Please remove my name from the list, since you're not following my approach in your test: You're not preallocating the memory by providing an index of suitable size. – FooBar Jul 29 '19 at 18:27
  • @FooBar Hi! I'm glad you as the author saw my answer :) you are right, I missed this important point. I prefer to add one more row for my result table as your approach show the different result! – Mikhail_Sam Jul 30 '19 at 08:17
  • @Mikhail_Sam How would you use pivot-table to write it on an excel file using the fastest method, dict ? – FabioSpaghetti Aug 11 '19 at 12:36
  • Why use a list of dicts instead of just hstacking the numpy into a 2D array, and converting and adding column names at the end? – rayzinnz Aug 12 '19 at 08:33
  • @FabioSpaghetti in the dict-methods we get df at the ens. So as in the all other methods. So there is no any problem to create [pivot-table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) on it. Here, for example, you can find example of reading/writing to Excel: [pivot-tables on Excel](https://pbpython.com/pandas-pivot-report.html) – Mikhail_Sam Aug 16 '19 at 08:53
  • 1
    @Mikhail_Sam For the last, dict approach, what's the rationale behind using two loops, `for i in range (0,5):` and `for i in range( 1,numOfRows-4):`? – flow2k Sep 29 '19 at 09:03
  • 1
    Just wanted to throw out another comment as to why the Dict to Pandas DataFrame is a better way. In my experimentation with a dataset that has multiple different data types in the table, using the Pandas append methods destroy the typing, whereas using a Dict, and only creating the DataFrame from it ONCE, seems to keep the original datatypes intact. – trumpetlicks Dec 04 '19 at 14:23
  • I have a different method which should be comparable to dictionary method but could be marginally faster that you should try timing. https://stackoverflow.com/a/59524806/3163618 – qwr Dec 30 '19 at 01:45
  • I tested it and it is slightly faster: list-of-dicts: 1000:0.034, 5000:0.155, 10000:0.342. dict-of-lists: 1000:0.032, 5000:0.149, 10000:0.295 – qwr Dec 30 '19 at 01:56
  • I tried this in my code using Tuple and it is even faster than Dict. Just wondering that Tuple are immutable. So how are we able to append them ? – Sumit Pokhrel Mar 03 '20 at 22:10
  • @rayzinnz I have not tried, but I suppose it should not be faster - numpy arrays are contiguous pieces of memory, so numpy recreates the array every time you use htack. While the list just add a pointer to a new variable in another memory location. – Mikhail_Sam Jun 05 '20 at 07:30
  • I think the `dict` approach should be renamed the `list.append` approach (and `append` into `df.append`), it is faster because it relies on a list `row_list.append()` and then creates a dataframe from that list instead of appending data directly on the dataframe with `df1.append()`. Both methods use dictionaries, the point is using `list()` vs `pd.DataFrame()` when populating data row by row. – EricLavault Oct 18 '20 at 13:42
146

NEVER grow a DataFrame!

Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?

Here are the most important reasons, taken from my post here.

  1. It is always cheaper/faster to append to a list and create a DataFrame in one go.
  2. Lists take up less memory and are a much lighter data structure to work with, append, and remove.
  3. dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
  4. An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.

This is The Right Way™ to accumulate your data

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

These options are horrible

  1. append or concat inside a loop

    append and concat aren't inherently bad in isolation. The problem starts when you iteratively call them inside a loop - this results in quadratic memory usage.

    # Creates empty DataFrame and appends
    df = pd.DataFrame(columns=['A', 'B', 'C'])
    for a, b, c in some_function_that_yields_data():
        df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)  
        # This is equally bad:
        # df = pd.concat(
        #       [df, pd.Series({'A': i, 'B': b, 'C': c})], 
        #       ignore_index=True)
    
  2. Empty DataFrame of NaNs

    Never create a DataFrame of NaNs as the columns are initialized with object (slow, un-vectorizable dtype).

    # Creates DataFrame of NaNs and overwrites values.
    df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
    for a, b, c in some_function_that_yields_data():
        df.loc[len(df)] = [a, b, c]
    

The Proof is in the Pudding

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

enter image description here

Benchmarking code for reference.


It's posts like this that remind me why I'm a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc or append if you're only adding a single row to your DataFrame. However, people often look to this question to add more than just one row - often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.

cs95
  • 274,032
  • 76
  • 480
  • 537
  • 3
    Fair enough. Are there any solution in case you need (or would like) a dataframe, but all your samples really do come one after the other? (Typically online learning or active learning) – user1657853 Sep 08 '20 at 15:52
  • This doesn't factor in the case where one needs the dataframe after every append(). In that case, the dataframe gets copied anyway, so the df.loc method is faster – Dev Aggarwal Sep 17 '20 at 10:45
  • @DevAggarwal incorrect, loc also creates a copy each time. Please see the graph timings in my answer. Append and loc_append are equally bad. I've also shared my code and process so you're free to convince yourself. – cs95 Sep 17 '20 at 16:36
  • Aplogies should have been clearer. Kindly create the dataframe from intermediate list inside the for loop, here -- https://gist.github.com/Coldsp33d/ea080f580ab3a3b75c2f40c201d50164#file-create_df-py-L13 – Dev Aggarwal Sep 17 '20 at 17:10
  • 1
    good answer @cs95 ! – adir abargil Jan 05 '21 at 12:44
  • Has anyone benchmarked what @DevAggarwal suggests? I often get to that case – Kuzeko Jan 19 '21 at 17:42
  • @Kuzeko the graph in my answer shows they are about the same in terms of performance, equally bad and so I wouldn't particularly recommend one over the other. If your case requires the Dataframe at each iteration, perhaps you should consider rethinking the structure of your code. Maybe use map or apply. – cs95 Jan 19 '21 at 17:45
  • Thanks a lot for the answer. I made a small benchmark adapting your code to my data and it still seems better the list append method, even though advantage is reduced. Unfortunately I need to pass it to third party libraries, so I need to create that dataframe. – Kuzeko Jan 20 '21 at 15:11
122

If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer):

import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )

# now fill it up row by row
for x in np.arange(0, numberOfRows):
    #loc or iloc both work here since the index is natural numbers
    df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]: 
   lib  qty1  qty2
0   -1    -1    -1
1    0     0     0
2   -1     0    -1
3    0    -1     0
4   -1     0     0

Speed comparison

In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, @fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop

And - as from the comments - with a size of 6000, the speed difference becomes even larger:

Increasing the size of the array (12) and the number of rows (500) makes the speed difference more striking: 313ms vs 2.29s

FooBar
  • 13,456
  • 10
  • 65
  • 140
  • 3
    Great answer. This should be the norm so that row space doesn't have to allocated incrementally. – ely Oct 09 '14 at 18:32
  • 8
    Increasing the size of the array(12) and the number of rows(500) makes the speed difference more striking: 313ms vs 2.29s – Tickon Apr 02 '15 at 10:55
90
mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
    df.loc[len(df)] = row
Lydia
  • 2,027
  • 13
  • 10
  • 3
    This! I've been searching for quite a while, and this is the first post that really shows how to assign particular values to a row! Bonus question: Which is the syntax for column-name/value pairs? I guess it must be something using a dict, but I can't seem to get it right. – jhin Mar 09 '16 at 00:00
  • 7
    this is not efficient as it actually copies the entire DataFrame when you extend it. – waterproof Jul 25 '19 at 16:42
78

You can append a single row as a dictionary using the ignore_index option.

>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
  Animal Color
0    cow  blue
1  horse   red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
  Animal  Color
0    cow   blue
1  horse    red
2  mouse  black
W.P. McNeill
  • 13,777
  • 9
  • 63
  • 94
  • 44
    You might also mention that `f.append()` creates a new object, rather than simply appending to the current object in place, so if you're trying to append to a dataframe in a script, you need to say `f = f.append()` – Blairg23 May 28 '16 at 03:57
  • 2
    is there a way to do this in place? – lol Nov 08 '16 at 03:48
  • @lol no. see https://github.com/pandas-dev/pandas/issues/2801 - the underlying arrays can't be extended so they have to be copied. – waterproof Jul 25 '19 at 16:42
  • I prefer this method because it is very SQL-like (not dependent on indices semantically) and I use it whenever possible. – Gene M Jul 31 '20 at 21:45
75

For efficient appending see How to add an extra row to a pandas dataframe and Setting With Enlargement.

Add rows through loc/ix on non existing key index data. e.g. :

In [1]: se = pd.Series([1,2,3])

In [2]: se
Out[2]: 
0    1
1    2
2    3
dtype: int64

In [3]: se[5] = 5.

In [4]: se
Out[4]: 
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

Or:

In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
   .....:                 columns=['A','B'])
   .....: 

In [2]: dfi
Out[2]: 
   A  B
0  0  1
1  2  3
2  4  5

In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']

In [4]: dfi
Out[4]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
In [5]: dfi.loc[3] = 5

In [6]: dfi
Out[6]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5
Community
  • 1
  • 1
Nasser Al-Wohaibi
  • 4,102
  • 1
  • 33
  • 26
47

For the sake of Pythonic way, here add my answer:

res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())

   lib  qty1  qty2
0  NaN  10.0   NaN
hkyi
  • 2,696
  • 2
  • 16
  • 9
31

You can also build up a list of lists and convert it to a dataframe -

import pandas as pd

columns = ['i','double','square']
rows = []

for i in range(6):
    row = [i, i*2, i*i]
    rows.append(row)

df = pd.DataFrame(rows, columns=columns)

giving

    i   double  square
0   0   0   0
1   1   2   1
2   2   4   4
3   3   6   9
4   4   8   16
5   5   10  25
Brian Burns
  • 14,953
  • 5
  • 69
  • 59
16

Figured out a simple and nice way:

>>> df
     A  B  C
one  1  2  3
>>> df.loc["two"] = [4,5,6]
>>> df
     A  B  C
one  1  2  3
two  4  5  6

Note the caveat with performance as noted in the comments

cs95
  • 274,032
  • 76
  • 480
  • 537
Qinsi
  • 712
  • 8
  • 15
  • 2
    Note that this will copy the entire DataFrame under the hood. The underlying arrays can't be extended so they have to be copied. – waterproof Jul 25 '19 at 16:43
14

This is not an answer to the OP question but a toy example to illustrate the answer of @ShikharDua above which I found very useful.

While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the stats below for more than one taget column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you @ShikharDua !

import pandas as pd 

BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
                          'Territory'  : ['West','East','South','West','East','South'],
                          'Product'  : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData

columns = ['Customer','Num Unique Products', 'List Unique Products']

rows_list=[]
for name, group in BaseData.groupby('Customer'):
    RecordtoAdd={} #initialise an empty dict 
    RecordtoAdd.update({'Customer' : name}) #
    RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})      
    RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})                   

    rows_list.append(RecordtoAdd)

AnalysedData = pd.DataFrame(rows_list)

print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)
user3250815
  • 149
  • 1
  • 2
9

You can use generator object to create Dataframe, which will be more memory efficient over the list.

num = 10

# Generator function to generate generator object
def numgen_func(num):
    for i in range(num):
        yield ('name_{}'.format(i), (i*i), (i*i*i))

# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )

df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))

To add raw to existing DataFrame you can use append method.

df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400  }])
RockStar
  • 1,136
  • 1
  • 10
  • 29
8

Create a new record(data frame) and add to old_data_frame.
pass list of values and corresponding column names to create a new_record (data_frame)

new_record = pd.DataFrame([[0,'abcd',0,1,123]],columns=['a','b','c','d','e'])

old_data_frame = pd.concat([old_data_frame,new_record])
Jack Daniel
  • 2,091
  • 3
  • 26
  • 42
7

Here is the way to add/append a row in pandas DataFrame

def add_row(df, row):
    df.loc[-1] = row
    df.index = df.index + 1  
    return df.sort_index()

add_row(df, [1,2,3]) 

It can be used to insert/append a row in empty or populated pandas DataFrame

Sociopath
  • 11,667
  • 16
  • 38
  • 61
shivampip
  • 1,306
  • 11
  • 14
7

Instead of a list of dictionaries as in ShikharDua's answer, we can also represent our table as a dictionary of lists, where each list stores one column in row-order, given we know our columns beforehand. At the end we construct our DataFrame once.

For c columns and n rows, this uses 1 dictionary and c lists, versus 1 list and n dictionaries. The list of dictionaries method has each dictionary storing all keys and requires creating a new dictionary for every row. Here we only append to lists, which is constant time and theoretically very fast.

# current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}

# adding a new row (be careful to ensure every column gets another value)
data["Animal"].append("mouse")
data["Color"].append("black")

# at the end, construct our DataFrame
df = pd.DataFrame(data)
#   Animal  Color
# 0    cow   blue
# 1  horse    red
# 2  mouse  black
qwr
  • 6,786
  • 3
  • 42
  • 72
4

if you want to add row at the end append it as a list

valuestoappend = [va1,val2,val3]
res = res.append(pd.Series(valuestoappend,index = ['lib', 'qty1', 'qty2']),ignore_index = True)
Shahir Ansari
  • 1,048
  • 12
  • 19
3

Another way to do it (probably not very performant):

# add a row
def add_row(df, row):
    colnames = list(df.columns)
    ncol = len(colnames)
    assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
    return df.append(pd.DataFrame([row], columns=colnames))

You can also enhance the DataFrame class like this:

import pandas as pd
def add_row(self, row):
    self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row
qed
  • 19,750
  • 16
  • 99
  • 168
3

All you need is loc[df.shape[0]] or loc[len(df)]


# Assuming your df has 4 columns (str, int, str, bool)
df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False] 

or

df.loc[len(df)] = ['col1Value', 100, 'col3Value', False] 
Giorgos Myrianthous
  • 24,168
  • 10
  • 80
  • 106
3

if you always want to add new row at the end....use this

df.loc[len(df)]=['name5',9,0]
Prajot Kuvalekar
  • 1,232
  • 1
  • 3
  • 12
2
initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]}

df = pd.DataFrame(initial_data)

df

lib qty1    qty2
0   1   1   1
1   2   2   2
2   3   3   3
3   4   4   4

val_1 = [10]
val_2 = [14]
val_3 = [20]

df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))

lib qty1    qty2
0   1   1   1
1   2   2   2
2   3   3   3
3   4   4   4
0   10  14  20

You can use for loop to iterate through values or can add arrays of values

val_1 = [10, 11, 12, 13]
val_2 = [14, 15, 16, 17]
val_3 = [20, 21, 22, 43]

df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))

lib qty1    qty2
0   1   1   1
1   2   2   2
2   3   3   3
3   4   4   4
0   10  14  20
1   11  15  21
2   12  16  22
3   13  17  43
Harshal Deore
  • 65
  • 1
  • 5
1

Make it simple. By taking list as input which will be appended as row in data-frame:-

import pandas as pd  
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))  
for i in range(5):  
    res_list = list(map(int, input().split()))  
    res = res.append(pd.Series(res_list,index=['lib','qty1','qty2']), ignore_index=True)
Vineet Jain
  • 1,288
  • 4
  • 16
  • 31
0

We often see the construct df.loc[subscript] = … to assign to one DataFrame row. Mikhail_Sam posted benchmarks containing, among others, this construct as well as the method using dict and create DataFrame in the end. He found the latter to be the fastest by far. But if we replace the df3.loc[i] = … (with preallocated DataFrame) in his code with df3.values[i] = …, the outcome changes significantly, in that that method performs similar to the one using dict. So we should more often take the use of df.values[subscript] = … into consideration. However note that .values takes a zero-based subscript, which may be different from the DataFrame.index.

Armali
  • 14,228
  • 13
  • 47
  • 141
  • 1
    @baxx - One _code example_ is at the _benchmarks_ link (`# .loc with prealloc`), another example is in the question [I have to compare data from each row of a Pandas DataFrame with data from the rest of the rows, is there a way to speed up the computation?](https://stackoverflow.com/questions/57572695/i-have-to-compare-data-from-each-row-of-a-pandas-dataframe-with-data-from-the-re) and its accepted answer. – Armali Feb 07 '20 at 10:23
0

pandas.DataFrame.append

DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=False) → 'DataFrame'

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)

With ignore_index set to True:

df.append(df2, ignore_index=True)
kamran kausar
  • 2,569
  • 18
  • 14
0

You can concatenate two DataFrames for this. I basically came across this problem to add a new row to an existing DataFrame with a character index(not numeric). So, I input the data for a new row in a duct() and index in a list.

new_dict = {put input for new row here}
new_list = [put your index here]

new_df = pd.DataFrame(data=new_dict, index=new_list)

df = pd.concat([existing_df, new_df])
hansrajSwapnil
  • 175
  • 2
  • 10
0

If all data in your Dataframe has the same dtype you might use a numpy array. You can write rows directly into the predefined array and convert it to a dataframe at the end. Seems to be even faster than converting a list of dicts.

import pandas as pd
import numpy as np
from string import ascii_uppercase

startTime = time.perf_counter()
numcols, numrows = 5, 10000
npdf = np.ones((numrows, numcols))
for row in range(numrows):
    npdf[row, 0:] = np.random.randint(0, 100, (1, numcols))
df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols]))      
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df5.shape)

   
Gerard
  • 147
  • 1
  • 6
0

If you have a data frame df and want to add a list new_list as a new row to df, you can simply do:

df.loc[len(df)] = new_list

If you want to add a new data frame new_df under data frame df, then you can use:

df.append(new_df)
Mahdi
  • 45
  • 6
-1

before going to add a row, we have to convert the dataframe to dictionary there you can see the keys as columns in dataframe and values of the columns are again stored in the dictionary but there key for every column is the index number in dataframe. That idea make me to write the below code.

df2=df.to_dict()
values=["s_101","hyderabad",10,20,16,13,15,12,12,13,25,26,25,27,"good","bad"] #this is total row that we are going to add
i=0
for x in df.columns:   #here df.columns gives us the main dictionary key
    df2[x][101]=values[i]   #here the 101 is our index number it is also key of sub dictionary
    i+=1
-4

This will take care of adding an item to an empty DataFrame. The issue is that df.index.max() == nan for the first index:

df = pd.DataFrame(columns=['timeMS', 'accelX', 'accelY', 'accelZ', 'gyroX', 'gyroY', 'gyroZ'])

df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = [x for x in range(7)]
Mark
  • 661
  • 8
  • 23
tomatom
  • 359
  • 3
  • 9