Highest Voted 'google-hadoop' Questions

11

votes

1 answer

BigQuery connector for pyspark via Hadoop Input Format example

I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output…

asked Jul 14 '15 at 08:11

Luca Fiaschi

2,945
6
27
43

10

votes

3 answers

"No Filesystem for Scheme: gs" when running spark job locally

I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder) When running the job locally on my Mac machine, I am getting the following error: 5932 [main] ERROR…

apache-spark hadoop google-cloud-storage google-cloud-dataproc google-hadoop

asked Jan 05 '15 at 15:41

Yaniv Donenfeld

525
2
7
25

7

votes

3 answers

Read from BigQuery into Spark in efficient way?

When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is…

apache-spark google-bigquery google-cloud-dataproc google-hadoop

asked Jan 04 '17 at 10:57

Mahmoud Hanafy

1,614
1
22
30

6

votes

2 answers

Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage. I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though…

google-api google-api-java-client google-hadoop

asked Aug 13 '14 at 16:25

obaid

265
4
12

5

votes

1 answer

How to manage conflicting DataProc Guava, Protobuf, and GRPC dependencies

I am working on a scala Spark job which needs to use java library (youtube/vitess) which is dependent upon newer versions of GRPC (1.01), Guava (19.0), and Protobuf (3.0.0) than currently provided on the DataProc 1.1 image. When running the project…

apache-spark google-cloud-dataproc google-hadoop vitess

asked Nov 09 '16 at 00:12

Smith

71
1
3

5

votes

3 answers

Hadoop cannot connect to Google Cloud Storage

I'm trying to connect Hadoop running on Google Cloud VM to Google Cloud Storage. I have: Modified the core-site.xml to include properties of fs.gs.impl and fs.AbstractFileSystem.gs.impl Downloaded and referenced…

google-app-engine hadoop google-cloud-storage google-hadoop

asked Sep 30 '14 at 23:42

Denny Lee

2,766
1
16
31

5

votes

1 answer

Accessing read-only Google Storage buckets from Hadoop

I am trying to access Google Storage bucket from a Hadoop cluster deployed in Google Cloud using the bdutil script. It fails if bucket access is read-only. What am I doing: Deploy a cluster with bdutil deploy -e datastore_env.sh On the…

hadoop google-cloud-storage gsutil google-cloud-platform google-hadoop

asked Aug 14 '14 at 14:50

Victor Gorelik

51
2

4

votes

2 answers

Rate limit with Apache Spark GCS connector

I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended), and get a lot of "rate limit" errors, as follows: java.io.IOException: Error inserting: bucket: *****, object: ***** at…

apache-spark google-cloud-storage google-cloud-platform pyspark google-hadoop

asked Aug 06 '15 at 08:57

Oren

218
2
9

3

votes

1 answer

GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

The original question was trying to deploy spark 1.4 on Google Cloud. After downloaded and set SPARK_HADOOP2_TARBALL_URI='gs://my_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz' deployment with bdutil was fine; however, when trying to call…

apache-spark google-hadoop

asked Jul 17 '15 at 15:07

Haiying Wang

622
6
10

3

votes

1 answer

SparkR collect method crashes with OutOfMemory on Java heap space

With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines. My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4…

r apache-spark google-hadoop sparkr

asked Jun 04 '15 at 13:45

Gouffe

161
1
10

2

votes

1 answer

How to speed up distcp when transferring data from Hadoop to Google Cloud Storage

The google cloud provides connectors for working with Hadoop.(https://cloud.google.com/hadoop/google-cloud-storage-connector) Using the connector, I receive data from hdfs to google cloud storage ex) hadoop discp hdfs://${path} gs://${path} but…

hadoop hdfs google-cloud-storage google-cloud-dataproc google-hadoop

asked Mar 23 '17 at 11:16

Lee. YunSu

336
4
16

2

votes

1 answer

Accessing google cloud storage using hadoop FileSystem api

From my machine, I've configured the hadoop core-site.xml to recognize the gs:// scheme and added gcs-connector-1.2.8.jar as a Hadoop lib. I can run hadoop fs -ls gs://mybucket/ and get the expected results. However, if I try to do the analogue from…

google-cloud-dataproc google-hadoop

asked Nov 06 '15 at 01:02

Alvin C

47
6

2

votes

1 answer

Hive cross join fails on local map join

Is there a direct way to address the following error or overall a better way to use Hive to get the join that I need? Output to a stored table isn't a requirement as I can be content with an INSERT OVERWRITE LOCAL DIRECTORY to a csv. I am trying to…

hadoop join hive cross-join google-hadoop

asked Aug 01 '15 at 17:39

Brett Bonner

395
2
17

2

votes

1 answer

Accessing Google Storage with SparkR on bdutil deployed cluster

I've been using bdutil for a year now, with hadoop and spark and this is quite perfect! Now I've got a little problem trying to get SparkR to work with Google Storage as HDFS. Here is my setup : - bdutil 1.2.1 - I have deployed a cluster with 1…

r apache-spark google-hadoop

asked May 27 '15 at 12:22

Gouffe

161
1
10

2

votes

2 answers

Where is the source of datastore-connector-latest.jar? Could I add this as a maven dependency?

I got connectors from https://cloud.google.com/hadoop/datastore-connector But I'm trying to add the datastore-connector (and bigquery-connector too) as a dependency in the pom... I don't know if it this is possible. I could not find the right…

google-app-engine maven hadoop google-compute-engine google-hadoop

asked Jan 30 '15 at 01:16

Eric N. Jurio

23
5

Questions tagged [google-hadoop]