Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool for "SQL and structured data processing" on . It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

Related tags: , ,

17592 questions
100
votes
9 answers

How to delete columns in pyspark dataframe

>>> a DataFrame[id: bigint, julian_date: string, user_id: bigint] >>> b DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint] >>> a.join(b, a.id==b.id, 'outer') DataFrame[id: bigint, julian_date: string, user_id: bigint,…
xjx0524
  • 1,011
  • 2
  • 7
  • 5
98
votes
10 answers

Extract column values of Dataframe as List in Apache Spark

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine.…
SH Y.
  • 1,549
  • 2
  • 15
  • 21
95
votes
6 answers

Renaming column names of a DataFrame in Spark Scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name. for( i <- 0 to origCols.length - 1) { df.withColumnRenamed( df.columns(i),…
Sam
  • 1,117
  • 2
  • 10
  • 13
90
votes
11 answers

Spark Dataframe distinguish columns with duplicated name

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot: [ Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0,…
resec
  • 1,493
  • 2
  • 10
  • 21
89
votes
11 answers

How to save DataFrame directly to Hive?

Is it possible to save DataFrame in spark directly to Hive? I have tried with converting DataFrame to Rdd and then saving as a text file and then loading in hive. But I am wondering if I can directly save dataframe to hive
Gourav
  • 1,065
  • 1
  • 9
  • 12
88
votes
6 answers

Convert pyspark string to date format

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column. I tried: df.select(to_date(df.STRING_COLUMN).alias('new_date')).show() and I get a string of nulls. Can…
Jenks
  • 1,420
  • 2
  • 17
  • 25
86
votes
2 answers

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism? I have tried to set both of them in SparkSQL, but the task number of the second stage is always 200.
Edison
  • 865
  • 1
  • 7
  • 7
86
votes
5 answers

How to export a table dataframe in PySpark to csv?

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns.…
PyRsquared
  • 5,027
  • 5
  • 36
  • 63
85
votes
11 answers

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other. Is there a way to replicate the following command sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2…
Francesco Sambo
  • 863
  • 1
  • 6
  • 6
83
votes
13 answers

Overwrite specific partitions in spark dataframe write method

I want to overwrite specific partitions instead of all in spark. I am trying the following command: df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4') where df is dataframe having the incremental data to be…
yatin
  • 833
  • 1
  • 7
  • 7
83
votes
3 answers

Spark SQL: apply aggregate functions to a list of columns

Is there a way to apply an aggregate function to all (or a list of) columns of a dataframe, when doing a groupBy? In other words, is there a way to avoid doing this for every column: df.groupBy("col1") .agg(sum("col2").alias("col2"),…
lilloraffa
  • 1,207
  • 2
  • 15
  • 22
83
votes
5 answers

Updating a dataframe column in spark

Looking at the new spark dataframe api, it is unclear whether it is possible to modify dataframe columns. How would I go about changing a value in row x column y of a dataframe? In pandas this would be df.ix[x,y] = new_value Edit: Consolidating what…
Luke
  • 5,052
  • 9
  • 39
  • 68
82
votes
13 answers

Best way to get the max value in a Spark dataframe column

I'm trying to figure out the best way to get the largest value in a Spark dataframe column. Consider the following example: df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"]) df.show() Which creates: +---+---+ | A| …
xenocyon
  • 1,828
  • 2
  • 17
  • 21
80
votes
6 answers

How to write unit tests in Spark 2.0+?

I've been trying to find a reasonable way to test SparkSession with the JUnit testing framework. While there seem to be good examples for SparkContext, I couldn't figure out how to get a corresponding example working for SparkSession, even though it…
bbarker
  • 7,931
  • 5
  • 30
  • 44
79
votes
5 answers

Get current number of partitions of a DataFrame

Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn't found a method for that, or am I just missed it? (In case of JavaRDD there's a getNumPartitions() method.)
kecso
  • 2,155
  • 2
  • 16
  • 27