Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1136 questions
0
votes
2 answers

Proxying Resource Manager in Google Dataproc

I've followed Google instructions on this. gcloud compute ssh --zone=us-central1-b --ssh-flag="-D 8088" --ssh-flag="-N" --ssh-flag="-n" spark-test-m followed by /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome…
J.Fratzke
  • 1,067
  • 12
  • 19
0
votes
1 answer

Scheduled mapreduce job on Google Cloud Platform

I'm developing a node.js application that basically stores user event logs in a database and shows insights about user actions. For achieving this event logs must be analyzed by using a Mapreduce job which would run once a day automatically (every…
0
votes
0 answers

Error Installing Oozie on Dataproc

I was first using a Dataproc initialization script provided by Google (here) to install Oozie on a new cluster and noticed that I couldn't hit the UI or run jobs on the command line. Diagnosing I went ahead and deleted the cluster then recreated a…
Khirok
  • 507
  • 1
  • 12
  • 19
0
votes
2 answers

Dataproc + python package: Distribute updated versions

Currently I am developing a Spark application on Google DataProc. Frequently, I need to update the Python package. During provisioning I run the following commands: echo "Downloading and extracting source code..." gsutil cp…
Frank
  • 387
  • 1
  • 12
0
votes
1 answer

Apache Spark job runs locally but throwing null pointer on Google Cloud Cluster

I have an application for Apache Spark that I have been until now running/testing on local machine using command: spark --class "main.SomeMainClass" --master local[4] jarfile.jar And everything runs alright however when I submit this very same job…
MichaelDD
  • 716
  • 6
  • 14
0
votes
1 answer

read file in spark jobs from google cloud platform

I'm using spark on google cloud platform. Apparently I'm reading a file from the filesystem gs:///dir/file, but the log output prompts FileNotFoundException: `gs:/bucket/dir/file (No such file or dir exist) The missing / is obviously the…
0
votes
1 answer

Request had insufficient authentication scopes [403] when creating a cluster with Google Cloud Dataproc

In Google Cloud Platform the DataProc API is enabled. I am using the same key I use to access GCS and Big query to create a new cluster per this example. I get a Request had insufficient authentication scopes error on the following line. Operation…
PUG
  • 3,867
  • 11
  • 69
  • 106
0
votes
1 answer

Where does Google Dataproc store Spark logs on disk?

I'd like to get command line access to the live logs produced by my Spark app when I'm SSH'd into the master node (the machine hosting the Spark driver program). I'm able to see them using gcloud dataproc jobs wait, the Dataproc web UI, and in GCS,…
Jon Chase
  • 413
  • 4
  • 14
0
votes
1 answer

What is the best way to minimize the initialization time for Apache Spark jobs on Google Dataproc?

I am trying to use a REST service to trigger Spark jobs using Dataproc API client. However, each job inside the dataproc clusters take 10-15 s to initialize the Spark Driver and submit the application. I am wondering if there is an effective way to…
pashupati
  • 87
  • 7
0
votes
0 answers

Dataproc failed to read parquet file in google cloud storage

I have a parquet file in google cloud storage, then try to read it as below: val parquetFile = sqlContext.read.parquet("gs://eng_sandbox1/shaw/testparquet/part-r-00000-b4aecbee-724e-40ea-b868-95f7e3f758a7.gz.parquet") However, I encountered the…
Shaw Ou
  • 11
0
votes
1 answer

Google DataProc API spark cluster with c#

I have data in Big Query I want to run analytics on in a spark cluster. Per documentation if I instantiate a spark cluster it should come with a Big Query connector. I was looking for any sample code to do this, found one in pyspark. Could not find…
PUG
  • 3,867
  • 11
  • 69
  • 106
0
votes
1 answer

DataProc MapReduce stopped working

I run standard hbase class for counting rows (RowCounter) in a BigTable table. DataProc gui in Google Console is used. It worked fine, but after few weeks I tried to run similar jar and job fails due hardly explainable reason. This don't look like…
Daneel Yaitskov
  • 4,019
  • 6
  • 34
  • 40
0
votes
1 answer

Google Cloud Dataproc - job file erroring on sc.textFile() command

Here is my file that I submit as a PySpark job in Dataproc, thru the UI # Load file data fro Google Cloud Storage to Dataproc cluster, creating an RDD # Because Spark transforms are 'lazy', we do a 'count()' action to make sure # we…
Thom Rogers
  • 1,137
  • 1
  • 13
  • 27
0
votes
1 answer

exec sh from PySpark

I'm trying to run a .sh file loading from a .py file in a PySpark's job but I receive a message always saying that .sh file is not found This is my code: test.py: import os,sys os.system("sh ./check.sh") and my gcloud command: gcloud beta dataproc…
sergio
  • 21
  • 3
0
votes
1 answer

Accessing data in Google storage for Apache Spark SQL

I have about 30Gb worth of data in Cloud storage that I would like to query on using Apache Hive from a Dataproc cluster. What's the best strategy to access this data. Is the best approach to copy the data to my master via gsutil and access it from…
femibyte
  • 2,585
  • 4
  • 27
  • 51
1 2 3
75
76