Use for questions specific to Apache Spark 2.0. For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-2.0]
443 questions
57
votes
5 answers
What are the various join types in Spark?
I looked at the docs and it says the following join types are supported:
Type of join to perform. Default inner. Must be one of: inner, cross,
outer, full, full_outer, left, left_outer, right, right_outer,
left_semi, left_anti.
I looked at…
pathikrit
- 29,060
- 33
- 127
- 206
45
votes
4 answers
Spark parquet partitioning : Large number of files
I am trying to leverage spark partitioning. I was trying to do something like
data.write.partitionBy("key").parquet("/location")
The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…
Avishek Bhattacharya
- 5,052
- 3
- 27
- 45
37
votes
3 answers
Reading csv files with quoted fields containing embedded commas
I am reading a csv file in Pyspark as follows:
df_raw=spark.read.option("header","true").csv(csv_path)
However, the data file has quoted fields with embedded commas in them which
should not be treated as commas. How can I handle this in Pyspark ?…
femibyte
- 2,585
- 4
- 27
- 51
32
votes
3 answers
Spark 2.0 Dataset vs DataFrame
starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers:
What is the difference between
df.select("foo")
df.select($"foo")
do I understand correctly…
Georg Heiler
- 13,862
- 21
- 115
- 217
31
votes
6 answers
How to create SparkSession from existing SparkContext
I have a Spark application which using Spark 2.0 new API with SparkSession.
I am building this application on top of the another application which is using SparkContext. I would like to pass SparkContext to my application and initialize SparkSession…
Stefan Repcek
- 2,285
- 2
- 18
- 26
16
votes
2 answers
spark off heap memory config and tungsten
I thought that with the integration of project Tungesten, spark would automatically use off heap memory.
What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for…
Georg Heiler
- 13,862
- 21
- 115
- 217
15
votes
0 answers
Spark executors crash due to netty memory leak
when running spark streaming app that consumes data from kafka topic with 100 partitions, and the streaming runs with 10 executors, 5 cores and 20GB RAM per executor, the executors crash with the following log:
ERROR ResourceLeakDetector: LEAK:…
Elad Eldor
- 743
- 1
- 9
- 19
15
votes
3 answers
dynamically bind variable/parameter in Spark SQL?
How to bind variable in Apache Spark SQL? For example:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("SELECT * FROM src WHERE col1 = ${VAL1}").collect().foreach(println)
user3769729
- 151
- 1
- 1
- 3
14
votes
5 answers
Timeout Exception in Apache-Spark during program Execution
I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop.
The code exits with the following…
Yasir
- 615
- 1
- 6
- 21
13
votes
6 answers
Spark fails to start in local mode when disconnected [Possible bug in handling IPv6 in Spark??]
The problem is the same as described here Error when starting spark-shell local on Mac
... but I have failed to find a solution. I also used to get the malformed URI error but now I get expected hostname.
So when I am not connected to internet,…
Aliostad
- 76,981
- 19
- 152
- 203
12
votes
1 answer
Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?
SparkSession
.builder
.master("local[*]")
.config("spark.sql.warehouse.dir", "C:/tmp/spark")
.config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint")
.appName("my-test")
.getOrCreate
.readStream
…
Martin Brisiak
- 2,957
- 12
- 32
- 48
12
votes
2 answers
Apache Spark vs Apache Spark 2
What are the improvements Apache Spark2 brings compared to Apache Spark?
From architecture perspective
From application point of view
or more
YoungHobbit
- 12,384
- 9
- 44
- 70
11
votes
2 answers
How to convert RDD of dense vector into DataFrame in pyspark?
I have a DenseVector RDD like this
>>> frequencyDenseVectors.collect()
[DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0,…
Hardik Gupta
- 4,214
- 6
- 27
- 64
10
votes
1 answer
Pass system property to spark-submit and read file from classpath or custom path
I have recently found a way to use logback instead of log4j in Apache Spark (both for local use and spark-submit). However, there is last piece missing.
The issue is that Spark tries very hard not to see logback.xml settings in its classpath. I have…
Atais
- 9,017
- 5
- 61
- 101
10
votes
1 answer
Avoid starting HiveThriftServer2 with created context programmatically
We are trying to use ThriftServer to query data from spark temp tables, in spark 2.0.0.
First, we have created sparkSession with enabled Hive Support.
Currently, we start ThriftServer with sqlContext like…
VladoDemcak
- 3,708
- 3
- 26
- 38