Questions tagged [apache-spark-2.0]

Use for questions specific to Apache Spark 2.0. For general questions related to Apache Spark use the tag [apache-spark].

443 questions
57
votes
5 answers

What are the various join types in Spark?

I looked at the docs and it says the following join types are supported: Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti. I looked at…
pathikrit
  • 29,060
  • 33
  • 127
  • 206
45
votes
4 answers

Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…
37
votes
3 answers

Reading csv files with quoted fields containing embedded commas

I am reading a csv file in Pyspark as follows: df_raw=spark.read.option("header","true").csv(csv_path) However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ?…
femibyte
  • 2,585
  • 4
  • 27
  • 51
32
votes
3 answers

Spark 2.0 Dataset vs DataFrame

starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers: What is the difference between df.select("foo") df.select($"foo") do I understand correctly…
31
votes
6 answers

How to create SparkSession from existing SparkContext

I have a Spark application which using Spark 2.0 new API with SparkSession. I am building this application on top of the another application which is using SparkContext. I would like to pass SparkContext to my application and initialize SparkSession…
Stefan Repcek
  • 2,285
  • 2
  • 18
  • 26
16
votes
2 answers

spark off heap memory config and tungsten

I thought that with the integration of project Tungesten, spark would automatically use off heap memory. What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for…
15
votes
0 answers

Spark executors crash due to netty memory leak

when running spark streaming app that consumes data from kafka topic with 100 partitions, and the streaming runs with 10 executors, 5 cores and 20GB RAM per executor, the executors crash with the following log: ERROR ResourceLeakDetector: LEAK:…
Elad Eldor
  • 743
  • 1
  • 9
  • 19
15
votes
3 answers

dynamically bind variable/parameter in Spark SQL?

How to bind variable in Apache Spark SQL? For example: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("SELECT * FROM src WHERE col1 = ${VAL1}").collect().foreach(println)
user3769729
  • 151
  • 1
  • 1
  • 3
14
votes
5 answers

Timeout Exception in Apache-Spark during program Execution

I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop. The code exits with the following…
Yasir
  • 615
  • 1
  • 6
  • 21
13
votes
6 answers

Spark fails to start in local mode when disconnected [Possible bug in handling IPv6 in Spark??]

The problem is the same as described here Error when starting spark-shell local on Mac ... but I have failed to find a solution. I also used to get the malformed URI error but now I get expected hostname. So when I am not connected to internet,…
Aliostad
  • 76,981
  • 19
  • 152
  • 203
12
votes
1 answer

Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?

SparkSession .builder .master("local[*]") .config("spark.sql.warehouse.dir", "C:/tmp/spark") .config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint") .appName("my-test") .getOrCreate .readStream …
12
votes
2 answers

Apache Spark vs Apache Spark 2

What are the improvements Apache Spark2 brings compared to Apache Spark? From architecture perspective From application point of view or more
YoungHobbit
  • 12,384
  • 9
  • 44
  • 70
11
votes
2 answers

How to convert RDD of dense vector into DataFrame in pyspark?

I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0,…
10
votes
1 answer

Pass system property to spark-submit and read file from classpath or custom path

I have recently found a way to use logback instead of log4j in Apache Spark (both for local use and spark-submit). However, there is last piece missing. The issue is that Spark tries very hard not to see logback.xml settings in its classpath. I have…
Atais
  • 9,017
  • 5
  • 61
  • 101
10
votes
1 answer

Avoid starting HiveThriftServer2 with created context programmatically

We are trying to use ThriftServer to query data from spark temp tables, in spark 2.0.0. First, we have created sparkSession with enabled Hive Support. Currently, we start ThriftServer with sqlContext like…
VladoDemcak
  • 3,708
  • 3
  • 26
  • 38
1
2 3
29 30