10

I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder)

When running the job locally on my Mac machine, I am getting the following error:

5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs

I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following setup in core-site.xml of the Hadoop installation:

<property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    <description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>
     The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
    </description>
</property>

I think my problem comes from the fact I am not sure where exactly each piece need to be configured in this local mode. In the Intellij project, I am using Maven, and so I imported the spark library as follows:

<dependency> <!-- Spark dependency -->
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>1.2.0</version>
    <exclusions>
        <exclusion>  <!-- declare the exclusion here -->
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
        </exclusion>
    </exclusions>
</dependency>

, and Hadoop 1.2.1 as follows:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>1.2.1</version>
</dependency>

The thing is, I am not sure where the hadoop location is configured for Spark, and also where the hadoop conf is configured. Therefore, I may be adding to the wrong Hadoop installation. In addition, is there something that needs to be restarted after modifying the files? As far as I saw, there is no Hadoop service running on my machine.

Aneesh R S
  • 3,312
  • 4
  • 21
  • 33
Yaniv Donenfeld
  • 525
  • 2
  • 7
  • 25

3 Answers3

6

In Scala, add the following config when setting your hadoopConfiguration:

val conf = sc.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
Art Haedike
  • 1,342
  • 2
  • 13
  • 12
  • Very elegant. You will probably have to include some [adequate dependency](https://mavenjars.com/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop3-2.0.0) for the latest google cloud storage connector to make that possible. – matanster Apr 29 '20 at 07:54
2

There are a couple ways to help Spark pick up the relevant Hadoop configurations, both involving modifying ${SPARK_INSTALL_DIR}/conf:

  1. Copy or symlink your ${HADOOP_HOME}/conf/core-site.xml into ${SPARK_INSTALL_DIR}/conf/core-site.xml. For example, when bdutil installs onto a VM, it runs:

    ln -s ${HADOOP_CONF_DIR}/core-site.xml ${SPARK_INSTALL_DIR}/conf/core-site.xml
    

Older Spark docs explain that this makes the xml files included in Spark's classpath automatically: https://spark.apache.org/docs/0.9.1/hadoop-third-party-distributions.html

  1. Add an entry to ${SPARK_INSTALL_DIR}/conf/spark-env.sh with:

    export HADOOP_CONF_DIR=/full/path/to/your/hadoop/conf/dir
    

Newer Spark docs seem to indicate this as the preferred method going forward: https://spark.apache.org/docs/1.1.0/hadoop-third-party-distributions.html

Dennis Huo
  • 9,799
  • 20
  • 41
  • But what is the Spark install dir when I use the Spark Maven component? – Yaniv Donenfeld Jan 07 '15 at 08:46
  • Ah, I see, if you're running straight out of your Maven project, you actually just need to make the core-site.xml (and probably also hdfs-site.xml) available in the classpath as mentioned elsewhere through the normal Maven means, namely by adding the two files to your `src/main/resources` directory. Edit: Pressed enter too early, here's a link to a blog post describing the similar case of Hadoop-only configuration with Maven: http://jayunit100.blogspot.com/2013/06/setting-hadoop-configuration-at-runtime.html – Dennis Huo Jan 07 '15 at 18:16
  • After adding the core-site.xml/hdfs-site.xml to the classpath, now I get the following error upon doing sc = new JavaSparkContext(conf); - java.lang.ClassNotFoundException: org.apache.hadoop.fs.LocalFileSystem. I am getting this, even though I have hadoop-core.jar version 1.2.1 in my classpath. – Yaniv Donenfeld Jan 08 '15 at 09:39
  • If you're running using `mvn exec:java` then indeed you'd expect the dependencies to be correctly present, but if you're doing `mvn package` and just running the jarfile, you have to explicitly ensure the right dependencies on your classpath. Commonly, you may want to build an "uberjar" which bundles all the transitive dependencies into a single jar that can be run without having to deal with classpaths. See this page: http://maven.apache.org/plugins/maven-shade-plugin/examples/includes-excludes.html - the second example is similar to what you need, you can try copy/pasting into your pom.xml – Dennis Huo Jan 08 '15 at 17:44
1

I can't say what's wrong, but here's what I would try.

  • Try setting fs.gs.project.id: <property><name>fs.gs.project.id</name><value>my-little-project</value></property>
  • Print sc.hadoopConfiguration.get(fs.gs.impl) to make sure your core-site.xml is getting loaded. Print it in the driver and also in the executor: println(x); rdd.foreachPartition { _ => println(x) }
  • Make sure the GCS jar is sent to the executors (sparkConf.setJars(...)). I don't think this would matter in local mode (it's all one JVM, right?) but you never know.

Nothing but your program needs to be restarted. There is no Hadoop process. In local and standalone modes Spark only uses Hadoop as a library, and only for IO I think.

Daniel Darabos
  • 25,678
  • 9
  • 94
  • 106
  • I tried your suggestions. It seems that adding the project id property did not affect. Regarding the fs.gs.impl, I can confirm the value is null, so that's probably the cause of the problem, but I am not sure why. I tried setting it even by code: conf.set("fs.gs.impl", com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.class.getName()); but it didn't change a thing. Is there a call in the API I can make to get the hadoop folder path? Maybe it points to the wrong Hadoop distribution, not the one I set the conf at – Yaniv Donenfeld Jan 06 '15 at 22:00
  • I think either `core-site.xml` or `conf/core-site.xml` needs to be on the classpath. – Daniel Darabos Jan 06 '15 at 23:00
  • After adding the core-site.xml/hdfs-site.xml to the classpath, now I get the following error upon doing sc = new JavaSparkContext(conf); - java.lang.ClassNotFoundException: org.apache.hadoop.fs.LocalFileSystem. I am getting this, even though I have hadoop-core.jar version 1.2.1 in my classpath. – Yaniv Donenfeld Jan 08 '15 at 09:40
  • In my project that class comes from `hadoop-common-2.2.0.jar`. – Daniel Darabos Jan 08 '15 at 12:59