Questions tagged [google-hadoop]

The open-source Apache Hadoop framework can be run on Google Cloud Platform for large-scale data processing, using Google Compute Engine VMs and Persistent Disks and optionally incorporating Google's tools and libraries for integrating Hadoop with other cloud services like Google Cloud Storage and BigQuery.

The open-source Apache Hadoop framework can be run on Google Cloud Platform for large-scale data processing, using Google Compute Engine VMs and Persistent Disks and optionally incorporating Google's tools and libraries for integrating Hadoop with other cloud services like Google Cloud Storage and BigQuery.

References:

70 questions
0
votes
2 answers

Hive INSERT OVERWRITE to Google Storage as LOCAL DIRECTORY not working

I use the following Hive Query: hive> INSERT OVERWRITE LOCAL DIRECTORY "gs:// Google/Storage/Directory/Path/Name" row format delimited fields terminated by ',' select * from .; I am getting the following…
Sujoy
  • 1
  • 1
0
votes
3 answers

Job tracking URL in Google Compute engine not working

I am using Google Compute Engine to run Mapreduce jobs on Hadoop (pretty much all default configs). While running the job I get a tracking URL of the form http://PROJECT_NAME:8088/proxy/application_X_Y/ but it fails to open. Did I forget to…
0
votes
1 answer

Spark 1.4 image for Google Cloud?

With bdutil, the latest version of tarball I can find is on spark 1.3.1: gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz There are a few new DataFrame features in Spark 1.4 that I want to use. Any chance the Spark 1.4 image be available for bdutil, or…
Haiying Wang
  • 622
  • 6
  • 10
0
votes
1 answer

How can I use GCP free credit to deploy Hadoop?

How can I use the Google Cloud Platform free trial to test a Hadoop cluster? What are the most important things I should keep in mind if I try this? Will I be charged during the free Google Cloud Platform trial?
James
  • 2,181
  • 11
  • 26
0
votes
1 answer

Deleted google storage directory appears "already exists" when calling Spark DataFrame.saveAsParquetFile()

After I deleted a Google Cloud Storage directory through the Google Cloud Console, (the directory was generated by early Spark (ver 1.3.1) job), when re-run the job, it always fail and seemed the directory was still there to the job; I cannot find…
Haiying Wang
  • 622
  • 6
  • 10
0
votes
1 answer

How to create a directory in HDFS on Google Cloud Platform via Java API

I am running an Hadoop Cluster on Google Cloud Platform, using Google Cloud Storage as backend for persistent data. I am able to ssh to the master node from a remote machine and run hadoop fs commands. Anyway when I try to execute the following code…
gl051
  • 561
  • 4
  • 8
0
votes
1 answer

Spark/Hadoop/Yarn cluster communication requires external ip?

I deployed Spark (1.3.1) with yarn-client on Hadoop (2.6) cluster using bdutil, by default, the instances are created with Ephemeral external ips, and so far spark works fine. With some security concerns, and assuming the cluster is internal…
Haiying Wang
  • 622
  • 6
  • 10
0
votes
1 answer

Map tasks with input from Cloud Storage use only one worker

I am trying to use a file from Google Cloud Storage via FileInputFormat as input for a MapReduce job. The file is in Avro format. As a simple test, I deployed a small Hadoop2 cluster with the bdutil tool, consisting of the master and two worker…
0
votes
1 answer

Multiple Hadoop clusters in one Google Cloud project

Is it possible to deploy several Hadoop clusters in one Google Cloud project?
Evgeny Timoshenko
  • 2,561
  • 4
  • 29
  • 49
0
votes
2 answers

Map Only MapReduce Job with BigQuery

We have a Mapreduce job created to inject data into BigQuery. There is not much of filtering function in our job so we'd like to make it map-only job to make it faster and more efficient. However, the java class "com.google.gson.JsonObject"…
0
votes
1 answer

bdutil: How to launch a Hadoop cluster with a requested image id? (Ubuntu 12.04)

When I attempt to launch a Hadoop cluster with the bdutil command, using one of the following: bdutil -b a_hadoop_test -n 1 -P mycluster -e hadoop2_env.sh -i ubuntu-1204 deploy OR bdutil -b a_hadoop_test -n 1 -P mycluster -e hadoop2_env.sh -i…
0
votes
1 answer

How to force bdutil command to run as root?

I am starting a Google Compute Engine VM from an App Engine application. The start-up scripts for the GCE VM run python scripts which, in turn, make os.system calls to bdutil commands, e.g., os.system("bdutil --bucket --num_workers 1 " …
0
votes
1 answer

Spark SQL on Google Compute Engine issue

We are using bdutil 1.1 to deploy a Spark (1.2.0) cluster. However, we are having an issue when we launch our spark script: py4j.protocol.Py4JJavaError: An error occurred while calling o70.registerTempTable. : java.lang.RuntimeException:…
0
votes
1 answer

Error when running Spark on a google cloud instance

I'm running a standalone application using Apache Spark and when I load all my data to a RDD as a textfile I got the following error: 15/02/27 20:34:40 ERROR Utils: Uncaught exception in thread stdout writer for python java.lang.OutOfMemoryError:…
Saulo Ricci
  • 698
  • 1
  • 8
  • 23
0
votes
1 answer

JobTracker - High memory and native thread usage

We are running hadoop on GCE with HDFS default file system, and data input/output from/to GCS. Hadoop version: 1.2.1 Connector version: com.google.cloud.bigdataoss:gcs-connector:1.3.0-hadoop1 Observed behavior: JT will accumulate threads in waiting…