Questions tagged [rdd]

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Resilient Distributed Datasets (a.k.a RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs.

RDD is the primary data abstraction in Apache Spark and the core of Spark (that I often refer to as "Spark Core").

The features of RDDs (decomposing the name):

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

For more information:

Mastering-Apache-Spark : RDD tutorial
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

3743 questions

317

votes

17 answers

Spark - repartition() vs coalesce()

According to Learning Spark Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the…

apache-spark distributed-computing rdd

asked Jul 24 '15 at 12:49

Praveen Sripati

29,779
15
74
108

298

votes

14 answers

Difference between DataFrame, Dataset, and RDD in Spark

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark? Can you convert one to the other?

dataframe apache-spark apache-spark-sql rdd apache-spark-dataset

asked Jul 20 '15 at 02:31

oikonomiyaki

6,659
11
52
85

217

votes

6 answers

What is the difference between cache and persist?

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?

apache-spark distributed-computing rdd

asked Nov 11 '14 at 17:14

Ramana

6,443
7
25
30

192

votes

2 answers

Spark performance for Scala vs Python

I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons. With that assumption, I thought to learn & write the Scala version of some very…

scala performance apache-spark pyspark rdd

asked Sep 08 '15 at 17:46

Mrityunjay

1,951
3
12
6

182

votes

5 answers

(Why) do we need to call cache or persist on a RDD

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the…

scala apache-spark rdd

asked Mar 11 '15 at 08:08

Ramana

6,443
7
25
30

151

votes

4 answers

Apache Spark: map vs mapPartitions?

What's the difference between an RDD's map and mapPartitions method? And does flatMap behave like map or like mapPartitions? Thanks. (edit) i.e. what is the difference (either semantically or in terms of execution) between def map[A, B](rdd:…

performance scala apache-spark rdd

asked Jan 17 '14 at 11:41

Nicholas White

2,452
3
23
27

148

votes

12 answers

How to convert rdd object to dataframe in spark

How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) to a Dataframe org.apache.spark.sql.DataFrame. I converted a dataframe to rdd using .rdd. After processing it I want it back in dataframe. How can I do this ?

scala apache-spark apache-spark-sql rdd

asked Apr 01 '15 at 05:38

user568109

43,824
15
87
118

votes

2 answers

What does "Stage Skipped" mean in Apache Spark web UI?

From my Spark UI. What does it mean by skipped?

apache-spark rdd

asked Jan 03 '16 at 19:26

Aravind Yarram

74,434
44
210
298

votes

3 answers

How does HashPartitioner work?

I read up on the documentation of HashPartitioner. Unfortunately nothing much was explained except for the API calls. I am under the assumption that HashPartitioner partitions the distributed set based on the hash of the keys. For example if my data…

scala apache-spark rdd partitioning

asked Jul 15 '15 at 07:46

Sohaib

4,058
7
35
66

votes

4 answers

How to find median and quantiles using Spark

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median. This question is similar to this question. However, the…

python apache-spark median rdd pyspark

asked Jul 15 '15 at 14:11

pr338

7,310
14
45
64

votes

2 answers

How DAG works under the covers in RDD?

The Spark research paper has prescribed a new distributed programming model over classic Hadoop MapReduce, claiming the simplification and vast performance boost in many cases specially on Machine Learning. However, the material to uncover the…

apache-spark rdd directed-acyclic-graphs

asked Sep 14 '14 at 17:59

sof

7,253
13
48
74

votes

4 answers

reduceByKey: How does it work internally?

I am new to Spark and Scala. I was confused about the way reduceByKey function works in Spark. Suppose we have the following code: val lines = sc.textFile("data.txt") val pairs = lines.map(s => (s, 1)) val counts = pairs.reduceByKey((a, b) => a +…

scala apache-spark rdd

asked May 09 '15 at 21:43

user764186

votes

6 answers

Spark: subtract two DataFrames

In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD) onlyNewData contains the rows in todaySchemRDD that do not…

apache-spark dataframe rdd

asked Apr 09 '15 at 11:42

Interfector

1,504
1
19
39

votes

2 answers

'PipelinedRDD' object has no attribute 'toDF' in PySpark

I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module (Pipeline ML) from Spark. I've just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured). My my_script.py is: from pyspark.mllib.util…

python apache-spark pyspark apache-spark-sql rdd

asked Sep 25 '15 at 18:21

Frederico Oliveira

1,875
2
12
10

votes

2 answers

Which operations preserve RDD order?

RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy(), as explained in this reply. Now, which operations preserve that order? E.g., is it guaranteed that (after…

apache-spark rdd

asked Mar 26 '15 at 16:39

sds

52,616
20
134
226

2 3

…

99 100 Next