Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1136 questions
58
votes
5 answers

What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?

I am using Google Data Flow to implement an ETL data ware house solution. Looking into google cloud offering, it seems DataProc can also do the same thing. It also seems DataProc is little bit cheaper than DataFlow. Does anybody know the pros /…
48
votes
6 answers

Google Cloud Platform: how to monitor memory usage of VM instances

I have recently performed a migration to Google Cloud Platform, and I really like it. However I can't find a way to monitor the memory usage of the Dataproc VM intances. As you can see on the attachment, the console provides utilization info about…
16
votes
2 answers

Output from Dataproc Spark job in Google Cloud Logging

Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available under Dataproc->Jobs in the console. There are two…
14
votes
3 answers

Where is the Spark UI on Google Dataproc?

What port should I use to access the Spark UI on Google Dataproc? I tried port 4040 and 7077 as well as a bunch of other ports I found using netstat -pln Firewall is properly configured.
BAR
  • 12,752
  • 18
  • 79
  • 153
12
votes
2 answers

Which HBase connector for Spark 2.0 should I use?

Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions. The Spark 2.0 and the new DataSet API support is not clear to me for the connectors I have…
12
votes
3 answers

While submit job with pyspark, how to access static files upload with --files argument?

for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py, I want to access the static file I uploaded. with…
lucemia
  • 5,677
  • 4
  • 36
  • 71
11
votes
2 answers

Error: permission denied on resource project when launching Dataproc cluster

I was successfully able to launch a dataproc cluster by manually creating one via gcloud dataproc clusters create.... However, when I try to launch one through a script (that automatically provisions and stops clusters), it says ERROR:…
claudiadast
  • 289
  • 5
  • 16
11
votes
1 answer

Unable to connect Google Storage file using GSC connector from Spark

I have written a spark job on my local machine which reads the file from google cloud storage using google hadoop connector like gs://storage.googleapis.com/ as mentioned in https://cloud.google.com/dataproc/docs/connectors/cloud-storage I have set…
11
votes
4 answers

spark.sql.crossJoin.enabled for Spark 2.x

I am using the 'preview' Google DataProc Image 1.1 with Spark 2.0.0. To complete one of my operations I have to complete a cartesian product. Since version 2.0.0 there has been a spark configuration parameter created (spark.sql.cross Join.enabled)…
Stijn
  • 439
  • 1
  • 8
  • 18
11
votes
1 answer

BigQuery connector for pyspark via Hadoop Input Format example

I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output…
10
votes
3 answers

Guava version while using spark-shell

I'm trying to use the spark-cassandra-connector via spark-shell on dataproc, however I am unable to connect to my cluster. It appears that there is a version mismatch since the classpath is including a much older guava version from somewhere else,…
10
votes
1 answer

Incorrect memory allocation for Yarn/Spark after automatic setup of Dataproc Cluster

I'm trying to run Spark jobs on a Dataproc cluster, but Spark will not start due to Yarn being misconfigured. I receive the following error when running "spark-shell" from the shell (locally on the master), as well as when uploading a job through…
habitats
  • 1,693
  • 2
  • 19
  • 30
10
votes
2 answers

Dataproc + BigQuery examples - any available?

According to the Dataproc docos, it has "native and automatic integrations with BigQuery". I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then…
10
votes
3 answers

"No Filesystem for Scheme: gs" when running spark job locally

I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder) When running the job locally on my Mac machine, I am getting the following error: 5932 [main] ERROR…
8
votes
1 answer

How to install python packages in a Google Dataproc cluster

Is it possible to install python packages in a Google Dataproc cluster after the cluster is created and running? I tried to use "pip install xxxxxxx" in the master command line but it does not seem to work. Google's Dataproc documentation does not…
1
2 3
75 76