Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

95558 questions
14
votes
2 answers

Rendering a pandas dataframe as HTML with same styling as Jupyter Notebook

I would like to render a pandas dataframe to HTML in the same way as the Jupyter Notebook does it, i.e. with all the bells and wistles like nice looking styling, column highlighting, and column sorting on click. pandas.to_html outputs just a plain…
ccpizza
  • 21,405
  • 10
  • 121
  • 123
14
votes
1 answer

Pandas groupby and pct change not returning expected value

For each Name in the following dataframe I'm trying to find the percentage change from one Time to the next of the Amount column: Code to create the dataframe: import pandas as pd df = pd.DataFrame({'Name': ['Ali', 'Ali', 'Ali', 'Cala', 'Cala',…
willk
  • 3,028
  • 1
  • 21
  • 41
14
votes
3 answers

Marking the entire group if condition is true for a single row

I have a dataframe which has Dates and public holidays Date WeekNum Public_Holiday 1/1/2015 1 1 2/1/2015 1 0 3/1/2015 1 0 4/1/2015 1 0 5/1/2015 1 0 6/1/2015 1 0 7/1/2015 1 0 8/1/2015 2 0 9/1/2015 2 …
Moses Soleman
  • 1,047
  • 4
  • 10
  • 27
14
votes
4 answers

Compute co-occurrence matrix by counting values in cells

I have a dataframe like this df = pd.DataFrame({'a' : [1,1,0,0], 'b': [0,1,1,0], 'c': [0,0,1,1]}) I want to get a b c a 2 1 0 b 1 2 1 c 0 1 2 where a,b,c are column names, and I get the values counting '1' in all columns when the filter is '1'…
Edward
  • 3,679
  • 13
  • 36
  • 64
14
votes
2 answers

Can't replace 0 to nan in Python using Pandas

I have dataframe with only 1 column. I want to replace all '0' to np.nan but I can't achieve that. dataframe is called area. I…
Ilia Chigogidze
  • 1,175
  • 1
  • 7
  • 9
14
votes
3 answers

Check if pandas dataframe is subset of other dataframe

I have two Python Pandas dataframes A, B, with the same columns (obviously with different data). I want to check A is a subset of B, that is, all rows of A are contained in B. Any idea how to do it?
Paul
  • 411
  • 5
  • 15
14
votes
5 answers

Convert pandas.DataFrame to list of dictionaries in Python

I have a dictionary which is converted from a dataframe as below : a = d.to_json(orient='index') Dictionary : {"0":{"yr":2017,"PKID":"58306, 57011","Subject":"ABC","ID":"T001"},"1":{"yr":2018,"PKID":"1234,54321","Subject":"XYZ","ID":"T002"}} What…
Shankar Pandey
  • 431
  • 1
  • 4
  • 17
14
votes
3 answers

how to understand closed and label arguments in pandas resample method?

Based on the pandas documentation from here: Docs And the examples: >>> index = pd.date_range('1/1/2000', periods=9, freq='T') >>> series = pd.Series(range(9), index=index) >>> series 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01…
mingchau
  • 372
  • 2
  • 11
14
votes
6 answers

Get Column and Row Index for Highest Value in Dataframe Pandas

I'd like to know if there's a way to find the location (column and row index) of the highest value in a dataframe. So if for example my dataframe looks like this: A B C D E 0 100 9 1 12 …
christfan868
  • 401
  • 4
  • 12
14
votes
1 answer

PANDAS split dataframe to multiple by unique values rows

I have a DataFrame in Pandas PRICE Name PER CATEGORY STORENAME 0 9.99 MF gram Indica Store1 1 9.99 HY gram Herb Store2 2 9.99 FF gram Herb Store2 What I want to do is split…
Darshan Jadav
  • 314
  • 2
  • 15
14
votes
2 answers

Convert pandas DataFrame column of comma separated strings to one-hot encoded

I have a large dataframe (‘data’) made up of one column. Each row in the column is made of a string and each string is made up of comma separated categories. I wish to one hot encode this data. For example, data = {"mesh": ["A, B, C", "C,B",…
scutnex
  • 553
  • 1
  • 6
  • 18
14
votes
1 answer

pandas cut: how to convert categorical labels to strings (otherwise cannot export to Excel)?

I use pandas.cut() to discretise a continuous variable into a range, and then group by the result. After a lot of swearing because I couldn't figure out what was wrong, I have learnt that, if I don't supply custom labels to the cut() function, but…
Pythonista anonymous
  • 5,770
  • 14
  • 47
  • 87
14
votes
5 answers

Rotating the column name for a Panda DataFrame

I'm trying to make nicely formatted tables from pandas. Some of my column names are far too long. The cells for these columns are large cause the whole table to be a mess. In my example, is it possible to rotate the column names as they are…
Dan Fiorino
  • 328
  • 3
  • 17
14
votes
1 answer

Why is groupby so fast?

This is a follow up question to this one, where jezrael used pandas.DataFrame.groupby to increment by a factor of some hundreds the speed of a list creation. Specifically, let df be a large dataframe, then index = list(set(df.index)) list_df =…
14
votes
5 answers

use length function in substring in spark

I am trying to use the length function inside a substring function in a DataFrame but it gives error val substrDF = testDF.withColumn("newcol", substring($"col", 1, length($"col")-1)) below is the error error: type mismatch; found :…
satish
  • 153
  • 1
  • 1
  • 6
1 2 3
99
100