Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool for "SQL and structured data processing" on apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

17592 questions

298

votes

14 answers

Difference between DataFrame, Dataset, and RDD in Spark

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark? Can you convert one to the other?

dataframe apache-spark apache-spark-sql rdd apache-spark-dataset

asked Jul 20 '15 at 02:31

oikonomiyaki

6,659
11
52
85

169

votes

7 answers

How to select the first row of each group?

I have a DataFrame generated as follow: df.groupBy($"Hour", $"Category") .agg(sum($"value") as "TotalValue") .sort($"Hour".asc, $"TotalValue".desc)) The results look…

sql scala apache-spark dataframe apache-spark-sql

asked Nov 23 '15 at 18:49

Rami

6,898
16
61
98

164

votes

23 answers

How can I change column types in Spark SQL's DataFrame?

Suppose I'm doing something like: val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true")) df.printSchema() root |-- year: string (nullable = true) |-- make: string (nullable = true) |-- model: string…

scala apache-spark apache-spark-sql

asked Apr 01 '15 at 04:55

kevinykuo

4,050
4
20
29

157

votes

3 answers

How to add a constant column in a Spark DataFrame?

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: dt.withColumn('new_column',…

python apache-spark dataframe pyspark apache-spark-sql

asked Sep 25 '15 at 18:17

Evan Zamir

6,460
11
44
68

157

votes

6 answers

How to sort by column in descending order in Spark SQL?

I tried df.orderBy("col1").show(10) but it sorted in ascending order. df.sort("col1").show(10) also sorts in descending order. I looked on stackoverflow and the answers I found were all outdated or referred to RDDs. I'd like to use the native…

scala apache-spark apache-spark-sql

asked May 19 '15 at 17:45

Vedom

2,707
3
12
15

155

votes

10 answers

How do I add a new column to a Spark DataFrame (using PySpark)?

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column. I've tried the following without any success: type(randomed_hours) # => list # Create in Python and transform to RDD new_col = pd.DataFrame(randomed_hours,…

python apache-spark dataframe pyspark apache-spark-sql

asked Nov 12 '15 at 21:14

Boris

1,735
2
8
9

155

votes

14 answers

Spark - load CSV file as DataFrame?

I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name") I have tried: scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv") Error which I…

scala apache-spark hadoop apache-spark-sql hdfs

asked Apr 17 '15 at 16:10

Donbeo

14,217
30
93
162

148

votes

12 answers

How to convert rdd object to dataframe in spark

How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) to a Dataframe org.apache.spark.sql.DataFrame. I converted a dataframe to rdd using .rdd. After processing it I want it back in dataframe. How can I do this ?

scala apache-spark apache-spark-sql rdd

asked Apr 01 '15 at 05:38

user568109

43,824
15
87
118

145

votes

16 answers

Concatenate columns in Apache Spark DataFrame

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can use?

sql apache-spark dataframe apache-spark-sql

asked Jul 16 '15 at 09:49

Nipun

3,423
3
36
67

137

votes

5 answers

How to define partitioning of DataFrame?

I've started using Spark SQL and DataFrames in Spark 1.4.0. I'm wanting to define a custom partitioner on DataFrames, in Scala, but not seeing how to do this. One of the data tables I'm working with contains a list of transactions, by account,…

scala apache-spark dataframe apache-spark-sql partitioning

asked Jun 23 '15 at 06:48

rake

2,208
3
12
11

123

votes

10 answers

Filter Pyspark dataframe column with None value

I'm trying to filter a PySpark dataframe that has None as a row value: df.select('dt_mvmt').distinct().collect() [Row(dt_mvmt=u'2016-03-27'), Row(dt_mvmt=u'2016-03-28'), Row(dt_mvmt=u'2016-03-29'), Row(dt_mvmt=None), …

python apache-spark dataframe pyspark apache-spark-sql

asked May 16 '16 at 20:31

Ivan

16,448
25
85
133

122

votes

15 answers

How to check if spark dataframe is empty?

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that? Thanks. PS: I want to check if it's empty so that I only save the DataFrame if it's not empty

apache-spark apache-spark-sql

asked Sep 22 '15 at 02:52

auxdx

1,833
2
18
23

114

votes

5 answers

How to change a dataframe column from String type to Double type in PySpark?

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark. Following is the way, I did: toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType()) changedTypedf =…

python apache-spark dataframe pyspark apache-spark-sql

asked Aug 29 '15 at 09:34

Abhishek Choudhary

7,569
18
63
118

102

votes

5 answers

Spark DataFrame groupBy and sort in the descending order (pyspark)

I'm using pyspark(Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code. group_by_dataframe.count().filter("`count` >= 10").sort('count',…

python apache-spark dataframe pyspark apache-spark-sql

asked Dec 29 '15 at 15:57

rclakmal

1,413
2
14
19

101

votes

9 answers

How to create an empty DataFrame with a specified schema?

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.

scala apache-spark dataframe apache-spark-sql

asked Jul 17 '15 at 13:58

user1735076

2,665
7
17
16

2 3

…

99 100 Next