Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

95558 questions
2756
votes
27 answers

How to iterate over rows in a DataFrame in Pandas

I have a DataFrame from Pandas: import pandas as pd inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}] df = pd.DataFrame(inp) print df Output: c1 c2 0 10 100 1 11 110 2 12 120 Now I want to iterate over the rows of this…
Roman
  • 97,757
  • 149
  • 317
  • 426
2539
votes
11 answers

How to select rows from a DataFrame based on column values

How can I select rows from a DataFrame based on values in some column in Pandas? In SQL, I would use: SELECT * FROM table WHERE colume_name = some_value I tried to look at Pandas' documentation, but I did not immediately find the answer.
szli
  • 28,045
  • 8
  • 26
  • 37
2216
votes
29 answers

Renaming columns in Pandas

I have a DataFrame using Pandas and column labels that I need to edit to replace the original column labels. I'd like to change the column names in a DataFrame A where the original column names are: ['$a', '$b', '$c', '$d', '$e'] to ['a', 'b', 'c',…
user1504276
  • 22,263
  • 3
  • 12
  • 7
1647
votes
17 answers

Delete column from pandas DataFrame

When deleting a column in a DataFrame I use: del df['column_name'] And this works great. Why can't I use the following? del df.column_name Since it is possible to access the column/Series as df.column_name, I expected this to work.
John
  • 32,659
  • 27
  • 74
  • 102
1396
votes
19 answers

How to sort a dataframe by multiple column(s)

I want to sort a data.frame by multiple columns. For example, with the data.frame below I would like to sort by column z (descending) then by column b (ascending): dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), levels = c("Low",…
Christopher DuBois
  • 38,442
  • 23
  • 68
  • 91
1377
votes
21 answers

Selecting multiple columns in a Pandas dataframe

I have data in different columns, but I don't know how to extract it to save it in another variable. index a b c 1 2 3 4 2 3 4 5 How do I select 'a', 'b' and save it in to df1? I tried df1 = df['a':'b'] df1 = df.ix[:,…
user1234440
  • 18,511
  • 17
  • 51
  • 88
1358
votes
13 answers

How to join (merge) data frames (inner, outer, left, right)

Given two data frames: df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3))) df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1))) df1 # CustomerId Product # 1 Toaster # …
Dan Goldstein
  • 21,713
  • 17
  • 34
  • 41
1275
votes
15 answers

How do I get the row count of a Pandas DataFrame?

I'm trying to get the number of rows of dataframe df with Pandas, and here is my code. Method 1: total_rows = df.count print total_rows + 1 Method 2: total_rows = df['First_columnn_label'].count print total_rows + 1 Both the code snippets give me…
yemu
  • 18,591
  • 8
  • 25
  • 29
1155
votes
20 answers

Get list from pandas DataFrame column headers

I want to get a list of the column headers from a pandas DataFrame. The DataFrame will come from user input so I won't know how many columns there will be or what they will be called. For example, if I'm given a DataFrame like this: >>>…
natsuki_2002
  • 19,933
  • 18
  • 42
  • 49
1133
votes
28 answers

Adding new column to existing DataFrame in Python pandas

I have the following indexed DataFrame with named columns and rows not- continuous numbers: a b c d 2 0.671399 0.101208 -0.181532 0.241273 3 0.446172 -0.243316 0.051767 1.577318 5 0.614758 0.075793 -0.451460…
tomasz74
  • 13,747
  • 10
  • 32
  • 49
1117
votes
38 answers

How to change the order of DataFrame columns?

I have the following DataFrame (df): import numpy as np import pandas as pd df = pd.DataFrame(np.random.rand(10, 5)) I add more column(s) by assignment: df['mean'] = df.mean(1) How can I move the column mean to the front, i.e. set it as first…
Timmie
  • 11,359
  • 3
  • 12
  • 7
1116
votes
30 answers

Create pandas Dataframe by appending one row at a time

I understand that pandas is designed to load fully populated DataFrame but I need to create an empty DataFrame then add rows, one by one. What is the best way to do this ? I successfully created an empty DataFrame with : res =…
PhE
  • 12,544
  • 3
  • 18
  • 18
1037
votes
10 answers

Change column type in pandas

I want to convert a table, represented as a list of lists, into a Pandas DataFrame. As an extremely simplified example: a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']] df = pd.DataFrame(a) What is the best way to convert the columns…
user1642513
983
votes
12 answers

How to drop rows of Pandas DataFrame whose value in a certain column is NaN

I have this DataFrame and want only the records whose EPS column is not NaN: >>> df STK_ID EPS cash STK_ID RPT_Date 601166 20111231 601166 NaN NaN 600036 20111231 600036 NaN 12 600016 20111231 600016 …
bigbug
  • 40,984
  • 35
  • 71
  • 92
941
votes
20 answers

Drop data frame columns by name

I have a number of columns that I would like to remove from a data frame. I know that we can delete them individually using something like: df$x <- NULL But I was hoping to do this with fewer commands. Also, I know that I could drop columns using…
Btibert3
  • 34,187
  • 40
  • 119
  • 164
1
2 3
99 100