Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1136 questions
6
votes
2 answers

How can I inspect per executor/node memory usage metrics of a pyspark job on Dataproc?

I'm running a PySpark job in Google Cloud Dataproc, in a cluster with half the nodes being preemptible, and seeing several errors in the job output (the driver output) such as: ...spark.scheduler.TaskSetManager: Lost task 9696.0 in stage 0.0 ...…
6
votes
1 answer

GCP Dataproc has Druid available in alpha. How to load segments?

The dataproc page describing druid support has no section on how to load data into the cluster. I've been trying to do this using GC Storage, but don't know how to set up a spec for it that works. I'd expect the "firehose" section to have some…
radialmind
  • 239
  • 2
  • 14
6
votes
3 answers

How to use Google Cloud Storage for checkpoint location in streaming query?

Im trying to run Spark Structured Streaming job and save checkpoint to Google Storage, I have a couple of jobs, one w/o aggregation works perfectly, but second with aggregations throw exception. I found that someone have similar issues with…
6
votes
1 answer

org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 11.0 failed 4 times

I am using Google Cloud Dataproc to do spark job and my editor is Zepplin. I was trying to write json data into gcp bucket. It succeeded before when I tried 10MB file. But failed with 10GB file. My dataproc has 1 master with 4CPU, 26GB memory, 500GB…
6
votes
0 answers

facing error when creatig dataproc cluster on google

When I trying to create the cluster with 1 master and 2 data nodes, I am getting below error: Cannot start master: Insufficient number of DataNodes reporting Worker test-sparkjob-w-0 unable to register with master test-sparkjob-m. This could be…
Skumar
  • 61
  • 1
6
votes
2 answers

How to connect with JMX remotely to Spark worker on Dataproc

I can connect to the driver just fine by adding the following: spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=9178 \ …
habitats
  • 1,693
  • 2
  • 19
  • 30
6
votes
6 answers

Automatically shutdown Google Dataproc cluster after all jobs are completed

How can I programmatically shutdown a Google Dataproc cluster automatically after all jobs have completed? Dataproc provides creation, monitoring and management. But it seems I cannot find out how to delete the cluster.
6
votes
1 answer

How can I run two parallel jobs on Google Dataproc

I have one job that will take a long time to run on DataProc. In the meanwhile I need to be able to run other smaller jobs. From what I could gather from the Google Dataproc documentation, the platform is supposed to support multiple jobs, since it…
fbexiga
  • 75
  • 5
6
votes
1 answer

How to import csv files with massive column count into Apache Spark 2.0

I'm running into a problem importing multiple small csv files with over 250000 columns of float64 into Apache Spark 2.0 running as a Google Dataproc cluster. There are a handful of string columns but only really interested in 1 as the class…
mobcdi
  • 1,312
  • 1
  • 21
  • 42
6
votes
1 answer

How to update spark configuration after resizing worker nodes in Cloud Dataproc

I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console. I noticed that the two workers…
6
votes
1 answer

PySpark print to console

When running a PySpark job on the dataproc server like this gcloud --project dataproc jobs submit pyspark --cluster my print statements don't show up in my terminal. Is there any way to output data…
Roman
  • 6,398
  • 6
  • 50
  • 87
6
votes
1 answer

Spark looses all executors one minute after starting

I run pyspark on 8 node Google dataproc cluster with default settings. Few seconds after starting I see 30 executor cores running (as expected): >>> sc.defaultParallelism 30 One minute later: >>> sc.defaultParallelism 2 From that…
6
votes
2 answers

use an external library in pyspark job in a Spark cluster from google-dataproc

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this: I started a ssh session with the master node of my cluster,…
sweeeeeet
  • 1,599
  • 2
  • 19
  • 45
6
votes
1 answer

How do I install Python libraries automatically on Dataproc cluster startup?

How can I automatically install Python libraries on my Dataproc cluster when the cluster starts? This would save me the trouble of manually logging into the master and/or worker nodes to manually install the libraries I need. It would be great to…
5
votes
1 answer

Dataproc cluster fails to initialize

With the standard dataproc image 1.5 (Debian 10, Hadoop 2.10, Spark 2.4), a dataproc cluster cannot be created. Region is set to europe-west-2. The stack-driver log says: "Failed to initialize node -m: Component hdfs failed to…
tak
  • 85
  • 5
1 2
3
75 76