Highest Voted 'google-cloud-dataproc' Questions

8

votes

2 answers

Why does Spark (on Google Dataproc) not use all vcores?

I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below Based on some other questions like this and this, i have setup the cluster to use…

asked Jun 13 '17 at 18:48

borarak

907
1
10
22

8

votes

2 answers

How to read simple text file from Google Cloud Storage using Spark-Scala local Program

As given in the below blog, https://cloud.google.com/blog/big-data/2016/06/google-cloud-dataproc-the-fast-easy-and-safe-way-to-try-spark-20-preview I was trying to read file from Google Cloud Storage using Spark-scala. For that I have imported…

scala google-app-engine apache-spark-sql google-cloud-storage google-cloud-dataproc

asked Mar 01 '17 at 14:50

Shawn

467
6
16

8

votes

1 answer

Pausing Dataproc cluster - Google Compute engine

is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link:…

apache-spark google-cloud-dataproc

asked Jan 01 '16 at 17:38

femibyte

2,585
4
27
51

8

votes

1 answer

Google Cloud Dataproc configuration issues

I've been encountering various issues in some Spark LDA topic modeling (mainly disassociation errors at seemingly random intervals) I've been running, which I think mainly have to do with insufficient memory allocation on my executors. This would…

apache-spark google-cloud-platform lda google-cloud-dataproc

asked Dec 07 '15 at 18:32

moustachio

2,694
2
29
61

8

votes

1 answer

Submit a PySpark job to a cluster with the '--py-files' argument

I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value. This did not seem to work. Do I need to provide some relative path for the…

google-cloud-dataproc

asked Sep 25 '15 at 15:43

bjorndv

413
4
13

7

votes

1 answer

GCP Dataproc custom image Python environment

I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env…

python google-cloud-platform pyspark google-cloud-dataproc

asked Jul 12 '19 at 13:56

Claudio

642
3
11

7

votes

2 answers

ModuleNotFoundError because PySpark serializer is not able to locate library folder

I have the following folder structure - libfolder - lib1.py - lib2.py - main.py main.py calls libfolder.lib1.py which then calls libfolder.lib2.py and others. It all works perfectly fine in local machine but after I deploy it to Dataproc…

python apache-spark pyspark google-cloud-dataproc

asked Dec 20 '18 at 06:38

Golak Sarangi

664
6
20

7

votes

3 answers

Error while running PySpark DataProc Job due to python version

I create a dataproc cluster using the following command gcloud dataproc clusters create datascience \ --initialization-actions \ gs://dataproc-initialization-actions/jupyter/jupyter.sh \ However when I submit my PySpark Job I got the following…

python-3.x apache-spark google-cloud-dataproc

asked Jul 19 '18 at 15:58

Kassem Shehady

1,117
10
22

7

votes

1 answer

YARN applications cannot start when specifying YARN node labels

I'm trying to use YARN node labels to tag worker nodes, but when I run applications on YARN (Spark or simple YARN app), those applications cannot start. with Spark, when specifying --conf spark.yarn.am.nodeLabelExpression="my-label", the job cannot…

hadoop apache-spark yarn google-cloud-dataproc

asked Mar 07 '18 at 09:42

norbjd

6,496
3
25
56

7

votes

4 answers

How to run python3 on google's dataproc pyspark

I want to run a pyspark job through Google Cloud Platform dataproc, but I can't figure out how to setup pyspark to run python3 instead of 2.7 by default. The best I've been able to find is adding these initialization commands However, when I ssh…

python-3.x configuration pyspark google-cloud-platform google-cloud-dataproc

asked Aug 23 '17 at 15:33

Roman

6,398
6
50
87

7

votes

2 answers

Request insufficient authentication scopes when running Spark-Job on dataproc

I am trying to run the spark job on the google dataproc cluster as gcloud dataproc jobs submit hadoop --cluster \ --jar file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ --class org.apache.hadoop.examples.WordCount…

apache-spark google-cloud-platform google-cloud-dataproc

asked Apr 12 '17 at 13:43

Vishal

1,177
2
22
40

7

votes

1 answer

How to get path to the uploaded file

I am running an spark cluster on google cloud and I upload a configuration file with each job. What is the path to a file that is uploaded with a submit command? In the example below how can I read the file Configuration.properties before the…

scala apache-spark google-cloud-dataproc

asked Jan 16 '17 at 13:50

orestis

712
7
20

7

votes

3 answers

Read from BigQuery into Spark in efficient way?

When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is…

apache-spark google-bigquery google-cloud-dataproc google-hadoop

asked Jan 04 '17 at 10:57

Mahmoud Hanafy

1,614
1
22
30

7

votes

1 answer

Connecting IPython notebook to spark master running in different machines

I don't know if this is already answered in SO but I couldn't find a solution to my problem. I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook I have…

apache-spark ipython kubernetes google-kubernetes-engine google-cloud-dataproc

asked Feb 25 '16 at 08:35

med

261
4
9

6

votes

1 answer

How spark (2.3 or new version) determine the number of tasks to read hive table files in gs bucket or hdfs?

Input Data: a hive table (T) with 35 files (~1.5GB each, SequenceFile) files are in a gs bucket default fs.gs.block.size=~128MB all other parameters are default Experiment 1: create a dataproc with 2 workers (4 core per worker) run select…

apache-spark hadoop hive google-cloud-dataproc

asked Oct 16 '20 at 05:00

dykw

1,089
3
12
16

Questions tagged [google-cloud-dataproc]