Questions tagged [pandas]

Pandas is a Python library for data manipulation and analysis, e.g. dataframes, multidimensional time series and cross-sectional datasets commonly found in statistics, experimental science results, econometrics, or finance. Pandas is one of the main data science libraries in Python.

pandas is a Python library for PAN-el DA-ta manipulation and analysis, i.e. multidimensional time series and cross-sectional data sets commonly found in statistics, experimental science results, econometrics, or finance. pandas is implemented primarily using NumPy and Cython; it is intended to be able to integrate very easily with NumPy-based scientific libraries, such as statsmodels.

To create a reproducible pandas example:

Main Features:

  • Data structures: for 1 and 2 dimensional labeled datasets (respectively Series and DataFrames). Some of their main features include:
  • Automatically aligning data and interpolation
  • Handling missing observations in calculations
  • Convenient slicing and reshaping ("reindexing") functions
  • Categorical data types
  • Provide 'group by' aggregation or transformation functionality
  • Tools for merging/joining together data sets
  • Simple matplotlib integration for plotting and graphing
  • Multi-Indexing providing structure to indices that allow for representation of an arbitrary number of dimensions.
  • Date tools: objects for expressing date offsets or generating date ranges; some functionality similar to scikits.timeseries. Dates can be aligned to a specific time zone and converted/compared at-will
  • Statistical models: convenient ordinary least squares and panel OLS implementations for in-sample or rolling time series / cross-sectional regressions. These will hopefully be the starting point for implementing models
  • Intelligent Cython offloading; complex computations are performed rapidly due to these optimizations.
  • Static and moving statistical tools: mean, standard deviation, correlation, covariance
  • Rich User Documentation, using Sphinx

Asking Questions:

  • Before asking the question, make sure you have gone through the 10 Minutes to pandas introduction. It covers all the basic functionality of pandas.
  • See this question on asking good questions: How to make good reproducible pandas examples
  • Please provide the version of pandas, NumPy, and platform details (if appropriate) in your questions

Answering Questions:

Useful Canonicals:

More FAQs at this link.

Resources and Tutorials:

Books:

202712 questions
844
votes
7 answers

Convert list of dictionaries to a pandas DataFrame

I have a list of dictionaries like this: [{'points': 50, 'time': '5:00', 'year': 2010}, {'points': 25, 'time': '6:00', 'month': "february"}, {'points':90, 'time': '9:00', 'month': 'january'}, {'points_h1':20, 'month': 'june'}] And I want to turn…
appleLover
  • 11,323
  • 8
  • 30
  • 46
840
votes
13 answers

Pretty-print an entire Pandas Series / DataFrame

I work with Series and DataFrames on the terminal a lot. The default __repr__ for a Series returns a reduced sample, with some head and tail values, but the rest missing. Is there a builtin way to pretty-print the entire Series / DataFrame? …
Dun Peal
  • 12,539
  • 11
  • 27
  • 40
817
votes
8 answers

Writing a pandas DataFrame to CSV file

I have a dataframe in pandas which I would like to write to a CSV file. I am doing this using: df.to_csv('out.csv') And getting the error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 20: ordinal not in…
user7289
  • 25,989
  • 27
  • 64
  • 86
766
votes
20 answers

How do I expand the output display to see more columns of a pandas DataFrame?

Is there a way to widen the display of output in either interactive or script-execution mode? Specifically, I am using the describe() function on a pandas DataFrame. When the DataFrame is 5 columns (labels) wide, I get the descriptive statistics…
beets
  • 8,051
  • 4
  • 14
  • 11
738
votes
5 answers

How are iloc and loc different?

Can someone explain how these two methods of slicing are different? I've seen the docs, and I've seen these answers, but I still find myself unable to understand how the three are different. To me, they seem interchangeable in large part, because…
AZhao
  • 11,271
  • 6
  • 26
  • 48
691
votes
13 answers

Deleting DataFrame row in Pandas based on column value

I have the following DataFrame: daysago line_race rating rw wrating line_date 2007-03-31 62 11 56 1.000000 56.000000 2007-03-10 83 11 …
TravisVOX
  • 15,002
  • 13
  • 31
  • 36
645
votes
19 answers

Combine two columns of text in pandas dataframe

I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I'd like to create a variable called period that makes Year = 2000 and quarter= q2 into 2000q2. Can anyone help with that?
user2866103
  • 6,877
  • 6
  • 13
  • 13
610
votes
6 answers

Creating an empty Pandas DataFrame, then filling it?

I'm starting from the pandas DataFrame docs here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. So basically, I'd like to initialize the…
Matthias Kauer
  • 7,007
  • 5
  • 15
  • 18
599
votes
22 answers

Set value for particular cell in pandas DataFrame using index

I've created a Pandas DataFrame df = DataFrame(index=['A','B','C'], columns=['x','y']) and got this x y A NaN NaN B NaN NaN C NaN NaN Then I want to assign value to particular cell, for example for row 'C' and column 'x'. I've…
Mitkp
  • 6,087
  • 3
  • 11
  • 7
598
votes
14 answers

Select by partial string from a pandas DataFrame

I have a DataFrame with 4 columns of which 2 contain string values. I was wondering if there was a way to select rows based on a partial string match against a particular column? In other words, a function or lambda function that would do something…
euforia
  • 6,065
  • 3
  • 12
  • 5
590
votes
28 answers

How to count the NaN values in a column in pandas DataFrame

I want to find the number of NaN in each column of my data so that I can drop a column if it has fewer NaN than some threshold. I looked but wasn't able to find any function for this. value_counts is too slow for me because most of the values are…
user3799307
  • 5,909
  • 3
  • 9
  • 3
586
votes
8 answers

How to filter Pandas dataframe using 'in' and 'not in' like in SQL

How can I achieve the equivalents of SQL's IN and NOT IN? I have a list with the required values. Here's the scenario: df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']}) countries_to_keep = ['UK', 'China'] #…
LondonRob
  • 53,478
  • 30
  • 110
  • 152
583
votes
8 answers

How to convert index of a pandas dataframe into a column

This seems rather obvious, but I can't seem to figure out how to convert an index of data frame to a column? For example: df= gi ptt_loc 0 384444683 593 1 384444684 594 2 384444686 596 To, df= index1 …
msakya
  • 6,911
  • 5
  • 20
  • 29
580
votes
10 answers

Shuffle DataFrame rows

I have the following DataFrame: Col1 Col2 Col3 Type 0 1 2 3 1 1 4 5 6 1 ... 20 7 8 9 2 21 10 11 12 2 ... 45 13 14 15 3 46 16 17 18 3 ... The DataFrame…
JNevens
  • 8,412
  • 7
  • 35
  • 67
571
votes
8 answers

Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

I have a data frame df and I use several columns from it to groupby: df['col1','col2','col3','col4'].groupby(['col1','col2']).mean() In the above way I almost get the table (data frame) that I need. What is missing is an additional column that…
Roman
  • 97,757
  • 149
  • 317
  • 426