Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1136 questions
0
votes
1 answer

Dataproc : Submit a Spark Job through REST API

We are using GoogleCloudPlatform for big-data analytics. For processing we are currently using the google cloud dataproc & spark-streaming. I want to submit a Spark job using the REST API, but when I am calling the URI with the api-key, I am getting…
0
votes
1 answer

Performance monitoring for Google Cloud DataProc

We are using GoogleCloudPlatform for big-data analytics. For processing we are currently using the google cloud dataproc & spark-streaming. We would like to check the job performance using some monitoring utilities like Ganglia, Graphite,…
0
votes
0 answers

Spark Tasks not evenly distributed among executors (google cloud dataproc)

I noticed that after a repartition the tasks do not always get evenly distributed among the executors. This causes an enormous buildup. The repartition function randomly assigns a partition number for each item. It seems that the tasks are quite…
bjorndv
  • 413
  • 4
  • 13
0
votes
1 answer

Dataproc Cluster with Spark 1.6.X using scala 2.11.X

I'm looking for a way to use Spark on Dataproc built with Scala 2.11. I want to use 2.11 since my jobs pulls in ~10 BigQuery tables and I'm using the new reflection libraries to map the corresponding objects to case classes. (There's a bug with the…
J.Fratzke
  • 1,067
  • 12
  • 19
0
votes
1 answer

How we can deploy my existing kafka - spark - cassandra project to kafka - dataproc -cassandra in google-cloud-platform?

My existing project is kafka-spark-cassandra. Now I have got gcp account and have to migrate spark jobs to dataproc. In my existing spark jobs parameters like masterip,memory,cores etc are passed through command line which is triggerd by a linux…
0
votes
1 answer

Outputting billions of lines from Spark

I'm trying to output an RDD that has ~5,000,000 rows as a text file using PySpark. It's taking a long time, so what are some tips on how to make the .saveAsTextFile() faster? The rows are 3 columns each, and I'm saving to HDFS.
0
votes
1 answer

pyspark failed in google dataproc

My job failed with the following logs, however, I don't fully understand. It seems to be caused by "YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 24.7 GB of 24 GB physical ". But how can I…
Hang
  • 1
  • 1
  • 1
0
votes
1 answer

Google Cloud Spark ElasticSearch TransportClient connection exception

I am using Spark on Google Cloud and I have the following code to connect to an Elasticsearch database import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.transport.TransportClient; import…
orestis
  • 712
  • 7
  • 20
0
votes
1 answer

Dataproc bdutil versioning

Is possible to set hadoop cluster image version using the bdutil command-line tool? Using the WebUI console or GCloud is possible to chose image version 1.0 which supports Hadoop 2.x and Hive 1.2. In contrast, using bdutil, according to the…
0
votes
1 answer

Google Dataproc and BigQuery integration with custom query

I am running spark cluster using Google dataproc. I would like to get data from big-query using custom query. I am able to run the basic word count example but i am looking for a way to run custom query e.g. SELECT ROW_NUMBER() OVER() as Id, prop11…
gana
  • 165
  • 6
0
votes
1 answer

Google Dataflow vs Apache Spark Streaming (either on Google Cloud or with Google Dataproc)

I am new to Cloud and Big-data however have much of interest in these and I have significant experience in Java programming. I am currently working on my uni project for comparing performance of Apache Spark streaming with Google Cloud Dataflow. I…
0
votes
2 answers

Apache Mahout on Dataproc?

Is Apache Mahout (https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html) available on Google Dataproc?
0
votes
2 answers

How to access Cloud SQL from dataproc?

I have a dataproc cluster and I'd like to have the cluster access a Cloud SQL instance. When I created the cluster I assigned scope --scopes sql-admin but after reading the Cloud SQL documentation it looks like I need to connect through a proxy. How…
sthomps
  • 3,458
  • 6
  • 29
  • 50
0
votes
1 answer

why does google dataproc does not pull coreNLP jars although they are included in POM file?

My application is a java maven project that uses Spark. Here's the section in my pom that adds stanford coreNLP dependencies: edu.stanford.nlp stanford-corenlp
Kai
  • 1,275
  • 4
  • 14
  • 28
0
votes
1 answer

Google Cloud Sdk from DataProc Cluster

What is the right way to use/install python google cloud apis such as pub-sub from a google-dataproc cluster? For example if im using zeppelin/pyspark on the cluster and i want to use the pub-sub api, how should i prepare it? It is unclear to me…
ismisesisko
  • 141
  • 2
  • 10