Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool for "SQL and structured data processing" on apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

17592 questions

100

votes

9 answers

How to delete columns in pyspark dataframe

>>> a DataFrame[id: bigint, julian_date: string, user_id: bigint] >>> b DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint] >>> a.join(b, a.id==b.id, 'outer') DataFrame[id: bigint, julian_date: string, user_id: bigint,…

apache-spark apache-spark-sql pyspark

asked Apr 13 '15 at 08:10

xjx0524

1,011
2
7
5

votes

10 answers

Extract column values of Dataframe as List in Apache Spark

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine.…

scala apache-spark apache-spark-sql

asked Aug 14 '15 at 00:39

SH Y.

1,549
2
15
21

votes

6 answers

Renaming column names of a DataFrame in Spark Scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name. for( i <- 0 to origCols.length - 1) { df.withColumnRenamed( df.columns(i),…

scala apache-spark dataframe apache-spark-sql

asked Feb 24 '16 at 03:51

Sam

1,117
2
10
13

votes

11 answers

Spark Dataframe distinguish columns with duplicated name

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot: [ Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0,…

python apache-spark dataframe pyspark apache-spark-sql

asked Nov 18 '15 at 11:16

resec

1,493
2
10
21

votes

11 answers

How to save DataFrame directly to Hive?

Is it possible to save DataFrame in spark directly to Hive? I have tried with converting DataFrame to Rdd and then saving as a text file and then loading in hive. But I am wondering if I can directly save dataframe to hive

scala apache-spark hive apache-spark-sql

asked Jun 05 '15 at 10:15

Gourav

1,065
1
9
12

votes

6 answers

Convert pyspark string to date format

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column. I tried: df.select(to_date(df.STRING_COLUMN).alias('new_date')).show() and I get a string of nulls. Can…

apache-spark pyspark apache-spark-sql pyspark-sql

asked Jun 28 '16 at 15:45

Jenks

1,420
2
17
25

votes

2 answers

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism? I have tried to set both of them in SparkSQL, but the task number of the second stage is always 200.

performance apache-spark hadoop apache-spark-sql

asked Aug 16 '17 at 02:22

Edison

votes

5 answers

How to export a table dataframe in PySpark to csv?

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns.…

python apache-spark dataframe apache-spark-sql export-to-csv

asked Jul 13 '15 at 13:56

PyRsquared

5,027
5
36
63

votes

11 answers

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other. Is there a way to replicate the following command sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2…

pyspark apache-spark-sql

asked Mar 21 '16 at 13:27

Francesco Sambo

votes

13 answers

Overwrite specific partitions in spark dataframe write method

I want to overwrite specific partitions instead of all in spark. I am trying the following command: df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4') where df is dataframe having the incremental data to be…

apache-spark apache-spark-sql spark-dataframe

asked Jul 20 '16 at 18:00

yatin

votes

3 answers

Spark SQL: apply aggregate functions to a list of columns

Is there a way to apply an aggregate function to all (or a list of) columns of a dataframe, when doing a groupBy? In other words, is there a way to avoid doing this for every column: df.groupBy("col1") .agg(sum("col2").alias("col2"),…

apache-spark dataframe apache-spark-sql aggregate-functions

asked Nov 23 '15 at 23:40

lilloraffa

1,207
2
15
22

votes

5 answers

Updating a dataframe column in spark

Looking at the new spark dataframe api, it is unclear whether it is possible to modify dataframe columns. How would I go about changing a value in row x column y of a dataframe? In pandas this would be df.ix[x,y] = new_value Edit: Consolidating what…

python apache-spark pyspark apache-spark-sql spark-dataframe

asked Mar 17 '15 at 21:19

Luke

5,052
9
39
68

votes

13 answers

Best way to get the max value in a Spark dataframe column

I'm trying to figure out the best way to get the largest value in a Spark dataframe column. Consider the following example: df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"]) df.show() Which creates: +---+---+ | A| …

python apache-spark pyspark apache-spark-sql

asked Oct 19 '15 at 22:04

xenocyon

1,828
2
17
21

votes

6 answers

How to write unit tests in Spark 2.0+?

I've been trying to find a reasonable way to test SparkSession with the JUnit testing framework. While there seem to be good examples for SparkContext, I couldn't figure out how to get a corresponding example working for SparkSession, even though it…

scala unit-testing apache-spark junit apache-spark-sql

asked May 02 '17 at 02:46

bbarker

7,931
5
30
44

votes

5 answers

Get current number of partitions of a DataFrame

Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn't found a method for that, or am I just missed it? (In case of JavaRDD there's a getNumPartitions() method.)

apache-spark dataframe apache-spark-sql

asked Feb 11 '17 at 02:24

kecso

2,155
2
16
27

Prev 1

…

99 100 Next