Questions tagged [google-hadoop]

The open-source Apache Hadoop framework can be run on Google Cloud Platform for large-scale data processing, using Google Compute Engine VMs and Persistent Disks and optionally incorporating Google's tools and libraries for integrating Hadoop with other cloud services like Google Cloud Storage and BigQuery.

The open-source Apache Hadoop framework can be run on Google Cloud Platform for large-scale data processing, using Google Compute Engine VMs and Persistent Disks and optionally incorporating Google's tools and libraries for integrating Hadoop with other cloud services like Google Cloud Storage and BigQuery.

References:

70 questions
11
votes
1 answer

BigQuery connector for pyspark via Hadoop Input Format example

I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output…
10
votes
3 answers

"No Filesystem for Scheme: gs" when running spark job locally

I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder) When running the job locally on my Mac machine, I am getting the following error: 5932 [main] ERROR…
7
votes
3 answers

Read from BigQuery into Spark in efficient way?

When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is…
6
votes
2 answers

Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage. I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though…
obaid
  • 265
  • 4
  • 12
5
votes
1 answer

How to manage conflicting DataProc Guava, Protobuf, and GRPC dependencies

I am working on a scala Spark job which needs to use java library (youtube/vitess) which is dependent upon newer versions of GRPC (1.01), Guava (19.0), and Protobuf (3.0.0) than currently provided on the DataProc 1.1 image. When running the project…
5
votes
3 answers

Hadoop cannot connect to Google Cloud Storage

I'm trying to connect Hadoop running on Google Cloud VM to Google Cloud Storage. I have: Modified the core-site.xml to include properties of fs.gs.impl and fs.AbstractFileSystem.gs.impl Downloaded and referenced…
Denny Lee
  • 2,766
  • 1
  • 16
  • 31
5
votes
1 answer

Accessing read-only Google Storage buckets from Hadoop

I am trying to access Google Storage bucket from a Hadoop cluster deployed in Google Cloud using the bdutil script. It fails if bucket access is read-only. What am I doing: Deploy a cluster with bdutil deploy -e datastore_env.sh On the…
4
votes
2 answers

Rate limit with Apache Spark GCS connector

I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended), and get a lot of "rate limit" errors, as follows: java.io.IOException: Error inserting: bucket: *****, object: ***** at…
3
votes
1 answer

GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

The original question was trying to deploy spark 1.4 on Google Cloud. After downloaded and set SPARK_HADOOP2_TARBALL_URI='gs://my_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz' deployment with bdutil was fine; however, when trying to call…
Haiying Wang
  • 622
  • 6
  • 10
3
votes
1 answer

SparkR collect method crashes with OutOfMemory on Java heap space

With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines. My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4…
Gouffe
  • 161
  • 1
  • 10
2
votes
1 answer

How to speed up distcp when transferring data from Hadoop to Google Cloud Storage

The google cloud provides connectors for working with Hadoop.(https://cloud.google.com/hadoop/google-cloud-storage-connector) Using the connector, I receive data from hdfs to google cloud storage ex) hadoop discp hdfs://${path} gs://${path} but…
2
votes
1 answer

Accessing google cloud storage using hadoop FileSystem api

From my machine, I've configured the hadoop core-site.xml to recognize the gs:// scheme and added gcs-connector-1.2.8.jar as a Hadoop lib. I can run hadoop fs -ls gs://mybucket/ and get the expected results. However, if I try to do the analogue from…
Alvin C
  • 47
  • 6
2
votes
1 answer

Hive cross join fails on local map join

Is there a direct way to address the following error or overall a better way to use Hive to get the join that I need? Output to a stored table isn't a requirement as I can be content with an INSERT OVERWRITE LOCAL DIRECTORY to a csv. I am trying to…
Brett Bonner
  • 395
  • 2
  • 17
2
votes
1 answer

Accessing Google Storage with SparkR on bdutil deployed cluster

I've been using bdutil for a year now, with hadoop and spark and this is quite perfect! Now I've got a little problem trying to get SparkR to work with Google Storage as HDFS. Here is my setup : - bdutil 1.2.1 - I have deployed a cluster with 1…
Gouffe
  • 161
  • 1
  • 10
2
votes
2 answers

Where is the source of datastore-connector-latest.jar? Could I add this as a maven dependency?

I got connectors from https://cloud.google.com/hadoop/datastore-connector But I'm trying to add the datastore-connector (and bigquery-connector too) as a dependency in the pom... I don't know if it this is possible. I could not find the right…
1
2 3 4 5