Hadoop 3 gcs-connector doesn't work properly with latest version of spark 3 standalone mode-CodePudding

I wrote a simple Scala application which reads a parquet file from GCS bucket. The application uses :

JDK 17
Scala 2.12.17
Spark SQL 3.3.1
gcs-connector of hadoop3-2.2.7

The connector is taken from Maven, imported via sbt (Scala build tool). I'm not using the latest, 2.2.9, version because of this issue.

The application works perfectly in local mode, so I tried to switch to the standalone mode.

What I did is these steps:

Downloaded Spark 3.3.1 from here
Started the cluster manually like here

I tried to run the application again and faced this error:

[error] Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error]         at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
[error]         at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
[error]         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
[error]         at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
[error]         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
[error]         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
[error]         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
[error]         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
[error]         at org.apache.parquet.hadoop.util.HadoopInputFile.fromStatus(HadoopInputFile.java:44)
[error]         at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
[error]         at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:484)
[error]         ... 14 more
[error] Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error]         at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
[error]         at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
[error]         ... 24 more

Somehow it cannot detect connector's file system: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

My spark configuration is pretty basic:

spark.app.name = "Example app"
spark.master = "spark://YOUR_SPARK_MASTER_HOST:7077"
spark.hadoop.fs.defaultFS = "gs://YOUR_GCP_BUCKET"
spark.hadoop.fs.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
spark.hadoop.fs.AbstractFileSystem.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
spark.hadoop.google.cloud.auth.service.account.enable = true
spark.hadoop.google.cloud.auth.service.account.json.keyfile = "src/main/resources/gcp_key.json"

CodePudding user response：

I ve found out that the maven version of GCS hadoop connector, is missing dependecies internally.

Ive fixed it by either:

downloading the connector from here https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage and providing to spark configuration on startup. (but it is not recommended to use in production, as the site is clearly stating)
providing missing dependencies for the connector.

to resolve the second option, I did unpack the gcs hadoop connector jar file, looked for the pom.xml, copy dependencies to a new stand alone xml file, and download them using mvn dependency:copy-dependencies -DoutputDirectory=/path/to/pyspark/jars/ command

here is example pom.xml that Ive created, please note I am using the 2.2.9 version of the connector

<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <name>TMP_PACKAGE_NAME</name>
    <description>
        jar dependencies of gcs hadoop connector
    </description>
    <!--'com.google.oauth-client:google-oauth-client:jar:1.34.1'
    -->
    <groupId>TMP_PACKAGE_GROUP</groupId>
    <artifactId>TMP_PACKAGE_NAME</artifactId>
    <version>0.0.1</version>
    <dependencies>

<dependency>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>gcs-connector</artifactId>
            <version>hadoop3-2.2.9</version>
        </dependency>

        <dependency>
            <groupId>com.google.api-client</groupId>
            <artifactId>google-api-client-jackson2</artifactId>
            <version>2.1.0</version>
        </dependency>

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>31.1-jre</version>
        </dependency>
        <dependency>
            <groupId>com.google.oauth-client</groupId>
            <artifactId>google-oauth-client</artifactId>
            <version>1.34.1</version>
        </dependency>

        <dependency>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>util</artifactId>
            <version>2.2.9</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>util-hadoop</artifactId>
            <version>hadoop3-2.2.9</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>gcsio</artifactId>
            <version>2.2.9</version>
        </dependency>
        <dependency>
            <groupId>com.google.auto.value</groupId>
            <artifactId>auto-value-annotations</artifactId>
            <version>1.10.1</version>
            <scope>runtime</scope>
        </dependency>

        <dependency>
            <groupId>com.google.flogger</groupId>
            <artifactId>flogger</artifactId>
            <version>0.7.4</version>
        </dependency>

        <dependency>
            <groupId>com.google.flogger</groupId>
            <artifactId>google-extensions</artifactId>
            <version>0.7.4</version>
        </dependency>

        <dependency>
            <groupId>com.google.flogger</groupId>
            <artifactId>flogger-system-backend</artifactId>
            <version>0.7.4</version>
        </dependency>

        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.10</version>
        </dependency>

    </dependencies>
</project>

hope this helps

CodePudding user response：

This is caused by the fact that Spark uses an old Guava library version and you used a non-shaded GCS connector jar. To make it work, you just need to use shaded GCS connector jar from Maven, for example: https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.9/gcs-connector-hadoop3-2.2.9-shaded.jar