Highest Voted 'google-cloud-dataproc' Questions

6

votes

2 answers

How can I inspect per executor/node memory usage metrics of a pyspark job on Dataproc?

I'm running a PySpark job in Google Cloud Dataproc, in a cluster with half the nodes being preemptible, and seeing several errors in the job output (the driver output) such as: ...spark.scheduler.TaskSetManager: Lost task 9696.0 in stage 0.0 ...…

asked Jun 23 '20 at 00:33

krishonadish

759
2
6
17

6

votes

1 answer

GCP Dataproc has Druid available in alpha. How to load segments?

The dataproc page describing druid support has no section on how to load data into the cluster. I've been trying to do this using GC Storage, but don't know how to set up a spec for it that works. I'd expect the "firehose" section to have some…

google-cloud-platform google-cloud-dataproc druid

asked Sep 20 '19 at 12:43

radialmind

239
2
14

6

votes

3 answers

How to use Google Cloud Storage for checkpoint location in streaming query?

Im trying to run Spark Structured Streaming job and save checkpoint to Google Storage, I have a couple of jobs, one w/o aggregation works perfectly, but second with aggregations throw exception. I found that someone have similar issues with…

google-cloud-platform google-cloud-storage spark-structured-streaming google-cloud-dataproc

asked May 15 '19 at 15:06

Oleksandr Marchenko

151
2
6

6

votes

1 answer

org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 11.0 failed 4 times

I am using Google Cloud Dataproc to do spark job and my editor is Zepplin. I was trying to write json data into gcp bucket. It succeeded before when I tried 10MB file. But failed with 10GB file. My dataproc has 1 master with 4CPU, 26GB memory, 500GB…

scala apache-spark google-cloud-platform google-cloud-storage google-cloud-dataproc

asked Apr 08 '19 at 02:04

wwwwan

339
1
2
10

6

votes

0 answers

facing error when creatig dataproc cluster on google

When I trying to create the cluster with 1 master and 2 data nodes, I am getting below error: Cannot start master: Insufficient number of DataNodes reporting Worker test-sparkjob-w-0 unable to register with master test-sparkjob-m. This could be…

google-cloud-dataproc

asked Jan 21 '18 at 17:32

Skumar

61
1

6

votes

2 answers

How to connect with JMX remotely to Spark worker on Dataproc

I can connect to the driver just fine by adding the following: spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=9178 \ …

apache-spark yarn google-cloud-dataproc

asked Aug 01 '17 at 10:22

habitats

1,693
2
19
30

6

votes

6 answers

Automatically shutdown Google Dataproc cluster after all jobs are completed

How can I programmatically shutdown a Google Dataproc cluster automatically after all jobs have completed? Dataproc provides creation, monitoring and management. But it seems I cannot find out how to delete the cluster.

google-cloud-platform google-cloud-dataproc

asked May 08 '17 at 07:29

Sreenath Chothar

163
1
13

6

votes

1 answer

How can I run two parallel jobs on Google Dataproc

I have one job that will take a long time to run on DataProc. In the meanwhile I need to be able to run other smaller jobs. From what I could gather from the Google Dataproc documentation, the platform is supposed to support multiple jobs, since it…

google-cloud-platform google-cloud-dataproc

asked Feb 13 '17 at 14:26

fbexiga

75
5

6

votes

1 answer

How to import csv files with massive column count into Apache Spark 2.0

I'm running into a problem importing multiple small csv files with over 250000 columns of float64 into Apache Spark 2.0 running as a Google Dataproc cluster. There are a handful of string columns but only really interested in 1 as the class…

csv apache-spark pyspark apache-spark-mllib google-cloud-dataproc

asked Aug 27 '16 at 19:27

mobcdi

1,312
1
21
42

6

votes

1 answer

How to update spark configuration after resizing worker nodes in Cloud Dataproc

I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console. I noticed that the two workers…

apache-spark google-compute-engine google-cloud-platform google-cloud-dataproc

asked Aug 03 '16 at 17:00

zyxue

5,603
2
28
47

6

votes

1 answer

PySpark print to console

When running a PySpark job on the dataproc server like this gcloud --project dataproc jobs submit pyspark --cluster my print statements don't show up in my terminal. Is there any way to output data…

python-2.7 pyspark google-cloud-dataproc

asked May 24 '16 at 07:40

Roman

6,398
6
50
87

6

votes

1 answer

Spark looses all executors one minute after starting

I run pyspark on 8 node Google dataproc cluster with default settings. Few seconds after starting I see 30 executor cores running (as expected): >>> sc.defaultParallelism 30 One minute later: >>> sc.defaultParallelism 2 From that…

apache-spark pyspark google-cloud-dataproc

asked Feb 26 '16 at 10:23

Tomas Vitulskis

65
1
5

6

votes

2 answers

use an external library in pyspark job in a Spark cluster from google-dataproc

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this: I started a ssh session with the master node of my cluster,…

import apache-spark pyspark google-cloud-dataproc

asked Oct 27 '15 at 08:38

sweeeeeet

1,599
2
19
45

6

votes

1 answer

How do I install Python libraries automatically on Dataproc cluster startup?

How can I automatically install Python libraries on my Dataproc cluster when the cluster starts? This would save me the trouble of manually logging into the master and/or worker nodes to manually install the libraries I need. It would be great to…

hadoop apache-spark google-cloud-platform google-cloud-dataproc

asked Sep 23 '15 at 17:29

James

2,181
11
26

5

votes

1 answer

Dataproc cluster fails to initialize

With the standard dataproc image 1.5 (Debian 10, Hadoop 2.10, Spark 2.4), a dataproc cluster cannot be created. Region is set to europe-west-2. The stack-driver log says: "Failed to initialize node -m: Component hdfs failed to…

google-cloud-platform google-cloud-dataproc

asked Aug 18 '20 at 14:10

tak

85
5

Questions tagged [google-cloud-dataproc]