Home > OS >  Cannot instantiate GoogleHadoopFileSystem in pyspark?
Cannot instantiate GoogleHadoopFileSystem in pyspark?

Time:09-28

The same code is working completely fine on Linux Ubuntu with the same jar files.My spark is 3.1.2 and hadoop is 3.2. Ive tried every gcs connector version from maven.

val = df.write.format('bigquery') \       #df is a spark.dataframe
            .mode(mode) \
            .option("credentialsFile", "creds.json") \
            .option('table', table) \
            .option("temporaryGcsBucket", bucket) \
            .save()

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/09/17 07:41:50 WARN FileSystem: Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem could not be instantiated
21/09/17 07:41:50 WARN FileSystem: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkState(ZLjava/lang/String;J)V
21/09/17 07:41:50 WARN FileSystem: Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem could not be instantiated
21/09/17 07:41:50 WARN FileSystem: java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
Traceback (most recent call last):
  File "c:\sparktest\main.py", line 158, in <module>
    val = df.write.format('bigquery') \
  File "c:\sparktest\vnenv\lib\site-packages\pyspark\sql\readwriter.py", line 828, in save
    self._jwrite.save()
  File "c:\sparktest\vnenv\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "c:\sparktest\vnenv\lib\site-packages\pyspark\sql\utils.py", line 128, in deco
    return f(*a, **kw)
  File "c:\sparktest\vnenv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o50.save.
: java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

CodePudding user response:

I forgot to add them in my spark config.

 spark = SparkSession \
        .builder \
        .appName(appName) \
        .config(conf=spark_conf) \
        .config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.22.0') \
        .getOrCreate()
  • Related