Highest Voted 'pyspark' Questions

244

votes

18 answers

How to change dataframe column names in pyspark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_column_name_list However, the same doesn't work in…

python apache-spark pyspark pyspark-sql

asked Dec 03 '15 at 22:21

Shubhanshu Mishra

4,850
4
18
23

192

votes

2 answers

Spark performance for Scala vs Python

I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons. With that assumption, I thought to learn & write the Scala version of some very…

scala performance apache-spark pyspark rdd

asked Sep 08 '15 at 17:46

Mrityunjay

1,951
3
12
6

159

votes

15 answers

How to turn off INFO logging in Spark?

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully. However, I cannot for the life of me figure out how to stop all…

python scala apache-spark hadoop pyspark

asked Aug 07 '14 at 22:48

horatio1701d

6,909
14
36
66

157

votes

3 answers

How to add a constant column in a Spark DataFrame?

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: dt.withColumn('new_column',…

python apache-spark dataframe pyspark apache-spark-sql

asked Sep 25 '15 at 18:17

Evan Zamir

6,460
11
44
68

155

votes

10 answers

How do I add a new column to a Spark DataFrame (using PySpark)?

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column. I've tried the following without any success: type(randomed_hours) # => list # Create in Python and transform to RDD new_col = pd.DataFrame(randomed_hours,…

python apache-spark dataframe pyspark apache-spark-sql

asked Nov 12 '15 at 21:14

Boris

1,735
2
8
9

123

votes

10 answers

Filter Pyspark dataframe column with None value

I'm trying to filter a PySpark dataframe that has None as a row value: df.select('dt_mvmt').distinct().collect() [Row(dt_mvmt=u'2016-03-27'), Row(dt_mvmt=u'2016-03-28'), Row(dt_mvmt=u'2016-03-29'), Row(dt_mvmt=None), …

python apache-spark dataframe pyspark apache-spark-sql

asked May 16 '16 at 20:31

Ivan

16,448
25
85
133

119

votes

19 answers

importing pyspark in python shell

This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736) I have Spark installed properly on my…

python apache-spark pyspark

asked Apr 23 '14 at 22:40

Glenn Strycker

4,376
6
27
48

118

votes

10 answers

Convert spark DataFrame column to python list

I work on a dataframe with two column, mvv and count. +---+-----+ |mvv|count| +---+-----+ | 1 | 5 | | 2 | 9 | | 3 | 3 | | 4 | 1 | i would like to obtain two list containing mvv values and count value. Something like mvv = [1,2,3,4] count =…

python apache-spark pyspark spark-dataframe

asked Jul 27 '16 at 10:36

a.moussa

2,185
3
25
43

118

votes

12 answers

Load CSV file with Spark

I'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing : sc.textFile('file.csv') .map(lambda line: (line.split(',')[0], line.split(',')[1])) .collect() I would expect this call to give me a list of…

python csv apache-spark pyspark

asked Feb 28 '15 at 14:41

Kernael

3,082
3
18
36

114

votes

5 answers

How to change a dataframe column from String type to Double type in PySpark?

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark. Following is the way, I did: toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType()) changedTypedf =…

python apache-spark dataframe pyspark apache-spark-sql

asked Aug 29 '15 at 09:34

Abhishek Choudhary

7,569
18
63
118

113

votes

5 answers

Spark Kill Running Application

I have a running Spark application where it occupies all the cores where my other applications won't be allocated any resource. I did some quick research and people suggested using YARN kill or /bin/spark-class to kill the command. However, I am…

apache-spark yarn pyspark

asked Apr 10 '15 at 15:51

B.Mr.W.

16,522
30
96
156

102

votes

5 answers

Spark DataFrame groupBy and sort in the descending order (pyspark)

I'm using pyspark(Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code. group_by_dataframe.count().filter("`count` >= 10").sort('count',…

python apache-spark dataframe pyspark apache-spark-sql

asked Dec 29 '15 at 15:57

rclakmal

1,413
2
14
19

101

votes

10 answers

show distinct column values in pyspark dataframe: python

Please suggest pyspark dataframe alternative for Pandas df['col'].unique(). I want to list out all the unique values in a pyspark dataframe column. Not the SQL type way (registertemplate then SQL query for distinct values). Also I don't need…

pyspark pyspark-sql

asked Sep 08 '16 at 06:03

Satya

3,707
16
38
63

100

votes

9 answers

How to delete columns in pyspark dataframe

>>> a DataFrame[id: bigint, julian_date: string, user_id: bigint] >>> b DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint] >>> a.join(b, a.id==b.id, 'outer') DataFrame[id: bigint, julian_date: string, user_id: bigint,…

apache-spark apache-spark-sql pyspark

asked Apr 13 '15 at 08:10

xjx0524

1,011
2
7
5

98

votes

5 answers

How to find the size or shape of a DataFrame in PySpark?

I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this. In Python I can do data.shape() Is there a similar function in PySpark. This is my current solution, but I am looking for an element…

dataframe pyspark size shapes

asked Sep 23 '16 at 04:42

Xi Liang

1,049
2
7
5

Questions tagged [pyspark]

Useful Links:

Related Tags:

How to change dataframe column names in pyspark?

Spark performance for Scala vs Python

How to turn off INFO logging in Spark?

How to add a constant column in a Spark DataFrame?

How do I add a new column to a Spark DataFrame (using PySpark)?

Filter Pyspark dataframe column with None value

importing pyspark in python shell

Convert spark DataFrame column to python list

Load CSV file with Spark

How to change a dataframe column from String type to Double type in PySpark?

Spark Kill Running Application

Spark DataFrame groupBy and sort in the descending order (pyspark)

show distinct column values in pyspark dataframe: python

How to delete columns in pyspark dataframe

How to find the size or shape of a DataFrame in PySpark?