Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

26160 questions
244
votes
18 answers

How to change dataframe column names in pyspark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_column_name_list However, the same doesn't work in…
Shubhanshu Mishra
  • 4,850
  • 4
  • 18
  • 23
192
votes
2 answers

Spark performance for Scala vs Python

I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons. With that assumption, I thought to learn & write the Scala version of some very…
Mrityunjay
  • 1,951
  • 3
  • 12
  • 6
159
votes
15 answers

How to turn off INFO logging in Spark?

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully. However, I cannot for the life of me figure out how to stop all…
horatio1701d
  • 6,909
  • 14
  • 36
  • 66
157
votes
3 answers

How to add a constant column in a Spark DataFrame?

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: dt.withColumn('new_column',…
Evan Zamir
  • 6,460
  • 11
  • 44
  • 68
155
votes
10 answers

How do I add a new column to a Spark DataFrame (using PySpark)?

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column. I've tried the following without any success: type(randomed_hours) # => list # Create in Python and transform to RDD new_col = pd.DataFrame(randomed_hours,…
Boris
  • 1,735
  • 2
  • 8
  • 9
123
votes
10 answers

Filter Pyspark dataframe column with None value

I'm trying to filter a PySpark dataframe that has None as a row value: df.select('dt_mvmt').distinct().collect() [Row(dt_mvmt=u'2016-03-27'), Row(dt_mvmt=u'2016-03-28'), Row(dt_mvmt=u'2016-03-29'), Row(dt_mvmt=None), …
Ivan
  • 16,448
  • 25
  • 85
  • 133
119
votes
19 answers

importing pyspark in python shell

This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736) I have Spark installed properly on my…
Glenn Strycker
  • 4,376
  • 6
  • 27
  • 48
118
votes
10 answers

Convert spark DataFrame column to python list

I work on a dataframe with two column, mvv and count. +---+-----+ |mvv|count| +---+-----+ | 1 | 5 | | 2 | 9 | | 3 | 3 | | 4 | 1 | i would like to obtain two list containing mvv values and count value. Something like mvv = [1,2,3,4] count =…
a.moussa
  • 2,185
  • 3
  • 25
  • 43
118
votes
12 answers

Load CSV file with Spark

I'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing : sc.textFile('file.csv') .map(lambda line: (line.split(',')[0], line.split(',')[1])) .collect() I would expect this call to give me a list of…
Kernael
  • 3,082
  • 3
  • 18
  • 36
114
votes
5 answers

How to change a dataframe column from String type to Double type in PySpark?

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark. Following is the way, I did: toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType()) changedTypedf =…
Abhishek Choudhary
  • 7,569
  • 18
  • 63
  • 118
113
votes
5 answers

Spark Kill Running Application

I have a running Spark application where it occupies all the cores where my other applications won't be allocated any resource. I did some quick research and people suggested using YARN kill or /bin/spark-class to kill the command. However, I am…
B.Mr.W.
  • 16,522
  • 30
  • 96
  • 156
102
votes
5 answers

Spark DataFrame groupBy and sort in the descending order (pyspark)

I'm using pyspark(Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code. group_by_dataframe.count().filter("`count` >= 10").sort('count',…
rclakmal
  • 1,413
  • 2
  • 14
  • 19
101
votes
10 answers

show distinct column values in pyspark dataframe: python

Please suggest pyspark dataframe alternative for Pandas df['col'].unique(). I want to list out all the unique values in a pyspark dataframe column. Not the SQL type way (registertemplate then SQL query for distinct values). Also I don't need…
Satya
  • 3,707
  • 16
  • 38
  • 63
100
votes
9 answers

How to delete columns in pyspark dataframe

>>> a DataFrame[id: bigint, julian_date: string, user_id: bigint] >>> b DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint] >>> a.join(b, a.id==b.id, 'outer') DataFrame[id: bigint, julian_date: string, user_id: bigint,…
xjx0524
  • 1,011
  • 2
  • 7
  • 5
98
votes
5 answers

How to find the size or shape of a DataFrame in PySpark?

I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this. In Python I can do data.shape() Is there a similar function in PySpark. This is my current solution, but I am looking for an element…
Xi Liang
  • 1,049
  • 2
  • 7
  • 5
1
2 3
99 100