Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1136 questions
8
votes
2 answers

Why does Spark (on Google Dataproc) not use all vcores?

I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below Based on some other questions like this and this, i have setup the cluster to use…
borarak
  • 907
  • 1
  • 10
  • 22
8
votes
2 answers

How to read simple text file from Google Cloud Storage using Spark-Scala local Program

As given in the below blog, https://cloud.google.com/blog/big-data/2016/06/google-cloud-dataproc-the-fast-easy-and-safe-way-to-try-spark-20-preview I was trying to read file from Google Cloud Storage using Spark-scala. For that I have imported…
8
votes
1 answer

Pausing Dataproc cluster - Google Compute engine

is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link:…
femibyte
  • 2,585
  • 4
  • 27
  • 51
8
votes
1 answer

Google Cloud Dataproc configuration issues

I've been encountering various issues in some Spark LDA topic modeling (mainly disassociation errors at seemingly random intervals) I've been running, which I think mainly have to do with insufficient memory allocation on my executors. This would…
8
votes
1 answer

Submit a PySpark job to a cluster with the '--py-files' argument

I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value. This did not seem to work. Do I need to provide some relative path for the…
bjorndv
  • 413
  • 4
  • 13
7
votes
1 answer

GCP Dataproc custom image Python environment

I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env…
7
votes
2 answers

ModuleNotFoundError because PySpark serializer is not able to locate library folder

I have the following folder structure - libfolder - lib1.py - lib2.py - main.py main.py calls libfolder.lib1.py which then calls libfolder.lib2.py and others. It all works perfectly fine in local machine but after I deploy it to Dataproc…
Golak Sarangi
  • 664
  • 6
  • 20
7
votes
3 answers

Error while running PySpark DataProc Job due to python version

I create a dataproc cluster using the following command gcloud dataproc clusters create datascience \ --initialization-actions \ gs://dataproc-initialization-actions/jupyter/jupyter.sh \ However when I submit my PySpark Job I got the following…
Kassem Shehady
  • 1,117
  • 10
  • 22
7
votes
1 answer

YARN applications cannot start when specifying YARN node labels

I'm trying to use YARN node labels to tag worker nodes, but when I run applications on YARN (Spark or simple YARN app), those applications cannot start. with Spark, when specifying --conf spark.yarn.am.nodeLabelExpression="my-label", the job cannot…
norbjd
  • 6,496
  • 3
  • 25
  • 56
7
votes
4 answers

How to run python3 on google's dataproc pyspark

I want to run a pyspark job through Google Cloud Platform dataproc, but I can't figure out how to setup pyspark to run python3 instead of 2.7 by default. The best I've been able to find is adding these initialization commands However, when I ssh…
7
votes
2 answers

Request insufficient authentication scopes when running Spark-Job on dataproc

I am trying to run the spark job on the google dataproc cluster as gcloud dataproc jobs submit hadoop --cluster \ --jar file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ --class org.apache.hadoop.examples.WordCount…
Vishal
  • 1,177
  • 2
  • 22
  • 40
7
votes
1 answer

How to get path to the uploaded file

I am running an spark cluster on google cloud and I upload a configuration file with each job. What is the path to a file that is uploaded with a submit command? In the example below how can I read the file Configuration.properties before the…
orestis
  • 712
  • 7
  • 20
7
votes
3 answers

Read from BigQuery into Spark in efficient way?

When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is…
7
votes
1 answer

Connecting IPython notebook to spark master running in different machines

I don't know if this is already answered in SO but I couldn't find a solution to my problem. I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook I have…
6
votes
1 answer

How spark (2.3 or new version) determine the number of tasks to read hive table files in gs bucket or hdfs?

Input Data: a hive table (T) with 35 files (~1.5GB each, SequenceFile) files are in a gs bucket default fs.gs.block.size=~128MB all other parameters are default Experiment 1: create a dataproc with 2 workers (4 core per worker) run select…
dykw
  • 1,089
  • 3
  • 12
  • 16
1
2
3
75 76