Questions tagged [spark-checkpoint]

34 questions
7
votes
1 answer

Spark Structure Streaming fail duo to checkpoint file not found

I am running spark structured streaming on a test env. It happens from time to time that the job fail duo to some checkpoint file is not found. One reason might be that the kafka topic has a very short retention time. But I've added…
5
votes
1 answer

How to read a checkpoint Dataframe in Spark Scala

I am trying to test below program to take the checkpoint and read if from checkpoint location if in case application fails due to any reason like resource unavailability. When I kill the job and retrigger it again, execution restarts from beginning.…
NRC
  • 63
  • 4
3
votes
2 answers

What is the difference between spark checkpoint and local checkpoint?

What is the difference between spark checkpoint and local checkpoint? When making local checkpoint I see this in the spark UI: It shows that local checkpoint is saved on memory.
Shadowtrooper
  • 1,067
  • 9
  • 20
3
votes
0 answers

How to figure out Kafka startingOffsets and endingOffsets in a scheduled Spark batch job?

I am trying to read from a Kafka topic in my Spark batch job and publish to another topic. I am not using streaming because it does not fit our use case. According to the spark docs, the batch job starts reading from the earliest Kafka offsets by…
ak0817
  • 708
  • 1
  • 8
  • 20
3
votes
2 answers

Iterative caching vs checkpointing in Spark

I have an iterative application running on Spark that I simplified to the following code: var anRDD: org.apache.spark.rdd.RDD[Int] = sc.parallelize((0 to 1000)) var c: Long = Int.MaxValue var iteration: Int = 0 while (c > 0) { iteration += 1 …
w4bo
  • 727
  • 5
  • 12
2
votes
0 answers

Spark streaming throwing errror of checkpoint after >10 min of run

I am executing the Streaming job with SQS on EMR, however after 10 min of run it starts throwing the error in background (Application still runs though), causing a lot of noise in logs. 2019-12-09 04:00:00,391 ERROR [JobGenerator]…
Sach
  • 3,115
  • 3
  • 17
  • 35
2
votes
1 answer

checkpointing / persisting / shuffling does not seem to 'short circuit' the lineage of an rdd as detailed in 'learning spark' book

In learning Spark, I read the following: In addition to pipelining, Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. Spark can “short-circuit” in this…
Chris Bedford
  • 2,724
  • 2
  • 23
  • 47
2
votes
0 answers

How do I use EMRFS for checkpointing with Structured Streaming?

I have been using S3 for checkpointing with Structured Streaming. However I am getting the FileNotFound Exception related to eventual consistency in S3. Below is what I currently have with S3 checkpointing. val msg =…
vkr
  • 643
  • 13
  • 29
2
votes
0 answers

Spark Checkpointing: Content, Recovery and Idempotency

I am trying to understand the content of a checkpoint and corresponding recovery; understanding the process of checkpointing is obviously the natural way of going about it and so I went over the following list: medium post SO Spark docs the very…
Sheel Pancholi
  • 469
  • 4
  • 19
2
votes
1 answer

reading from hive table and updating same table in pyspark - using checkpoint

I am using spark version 2.3 and trying to read hive table in spark as: from pyspark.sql import SparkSession from pyspark.sql.functions import * df = spark.table("emp.emptable") here I am adding a new column with current date from system to the…
vikrant rana
  • 3,734
  • 3
  • 23
  • 53
2
votes
1 answer

Spark not able to find checkpointed data in HDFS after executor fails

I am sreaming data from Kafka as below: final JavaPairDStream transformedMessages = rtStream .mapToPair(record -> new Tuple2(record.key(), record.value())) …
Amanpreet Khurana
  • 479
  • 1
  • 5
  • 14
1
vote
0 answers

How can I load a checkpointed pyspark dataframe

My code below crashed, and instead of to restart from the start, I would like to start from the last checkpointed dataframe. How can I load it? I have got this folder in my directory…
Florian
  • 102
  • 9
1
vote
1 answer

How spark calculates the window start time with given window interval?

Consider I have a input df with a timestamp field column and when setting window duration (with no sliding interval) as : 10 minutes with input of time(2019-02-28 22:33:02) window formed is as (2019-02-28 22:30:02) to (2019-02-28 22:40:02) 8…
1
vote
1 answer

spark checkpoint : error java.io.FileNotFoundException

I have a current pipeline, where I do several transformations to my dataframe. It is important to insert checkpoints to assure an accepted execution time. However from time to time I get this error from any of the checkpoints: Job aborted due to…
drlol
  • 153
  • 2
  • 14
1
vote
1 answer

Spark streaming SQS with checkpoint enable

I have went through multiple sites like…
Sach
  • 3,115
  • 3
  • 17
  • 35
1
2 3