Highest Voted 'apache-spark-2.0' Questions

57

votes

5 answers

What are the various join types in Spark?

I looked at the docs and it says the following join types are supported: Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti. I looked at…

asked Aug 31 '17 at 21:55

pathikrit

29,060
33
127
206

45

votes

4 answers

Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…

apache-spark spark-dataframe rdd apache-spark-2.0 bigdata

asked Jun 28 '17 at 16:49

Avishek Bhattacharya

5,052
3
27
45

37

votes

3 answers

Reading csv files with quoted fields containing embedded commas

I am reading a csv file in Pyspark as follows: df_raw=spark.read.option("header","true").csv(csv_path) However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ?…

csv apache-spark pyspark apache-spark-sql apache-spark-2.0

asked Nov 04 '16 at 00:34

femibyte

2,585
4
27
51

32

votes

3 answers

Spark 2.0 Dataset vs DataFrame

starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers: What is the difference between df.select("foo") df.select($"foo") do I understand correctly…

scala apache-spark apache-spark-sql apache-spark-dataset apache-spark-2.0

asked Nov 14 '16 at 19:44

Georg Heiler

13,862
21
115
217

31

votes

6 answers

How to create SparkSession from existing SparkContext

I have a Spark application which using Spark 2.0 new API with SparkSession. I am building this application on top of the another application which is using SparkContext. I would like to pass SparkContext to my application and initialize SparkSession…

scala apache-spark apache-spark-2.0

asked Mar 21 '17 at 18:20

Stefan Repcek

2,285
2
18
26

16

votes

2 answers

spark off heap memory config and tungsten

I thought that with the integration of project Tungesten, spark would automatically use off heap memory. What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for…

apache-spark apache-spark-sql spark-dataframe apache-spark-2.0 off-heap

asked Apr 10 '17 at 18:55

Georg Heiler

13,862
21
115
217

15

votes

0 answers

Spark executors crash due to netty memory leak

when running spark streaming app that consumes data from kafka topic with 100 partitions, and the streaming runs with 10 executors, 5 cores and 20GB RAM per executor, the executors crash with the following log: ERROR ResourceLeakDetector: LEAK:…

out-of-memory netty spark-streaming apache-spark-2.0

asked Oct 12 '17 at 15:49

Elad Eldor

743
1
9
19

15

votes

3 answers

dynamically bind variable/parameter in Spark SQL?

How to bind variable in Apache Spark SQL? For example: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("SELECT * FROM src WHERE col1 = ${VAL1}").collect().foreach(println)

scala apache-spark apache-spark-sql apache-spark-2.0

asked Nov 05 '14 at 10:44

user3769729

151
1
1
3

14

votes

5 answers

Timeout Exception in Apache-Spark during program Execution

I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop. The code exits with the following…

scala apache-spark spark-graphx apache-spark-2.0

asked Nov 22 '16 at 11:32

Yasir

615
1
6
21

13

votes

6 answers

Spark fails to start in local mode when disconnected [Possible bug in handling IPv6 in Spark??]

The problem is the same as described here Error when starting spark-shell local on Mac ... but I have failed to find a solution. I also used to get the malformed URI error but now I get expected hostname. So when I am not connected to internet,…

macos shell apache-spark apache-spark-2.0

asked Jan 28 '17 at 20:38

Aliostad

76,981
19
152
203

12

votes

1 answer

Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?

SparkSession .builder .master("local[*]") .config("spark.sql.warehouse.dir", "C:/tmp/spark") .config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint") .appName("my-test") .getOrCreate .readStream …

scala apache-spark apache-spark-sql apache-spark-2.0 spark-structured-streaming

asked Feb 06 '17 at 07:07

Martin Brisiak

2,957
12
32
48

12

votes

2 answers

Apache Spark vs Apache Spark 2

What are the improvements Apache Spark2 brings compared to Apache Spark? From architecture perspective From application point of view or more

apache-spark apache-spark-2.0

asked Oct 21 '16 at 05:03

YoungHobbit

12,384
9
44
70

11

votes

2 answers

How to convert RDD of dense vector into DataFrame in pyspark?

I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0,…

apache-spark pyspark apache-spark-mllib apache-spark-ml apache-spark-2.0

asked Dec 26 '16 at 09:05

Hardik Gupta

4,214
6
27
64

10

votes

1 answer

Pass system property to spark-submit and read file from classpath or custom path

I have recently found a way to use logback instead of log4j in Apache Spark (both for local use and spark-submit). However, there is last piece missing. The issue is that Spark tries very hard not to see logback.xml settings in its classpath. I have…

java scala apache-spark apache-spark-2.0 spark-submit

asked Aug 03 '17 at 17:17

Atais

9,017
5
61
101

10

votes

1 answer

Avoid starting HiveThriftServer2 with created context programmatically

We are trying to use ThriftServer to query data from spark temp tables, in spark 2.0.0. First, we have created sparkSession with enabled Hive Support. Currently, we start ThriftServer with sqlContext like…

hadoop apache-spark hive apache-spark-sql apache-spark-2.0

asked Sep 27 '16 at 07:50

VladoDemcak

3,708
3
26
38

Questions tagged [apache-spark-2.0]