Home > Software engineering >  Python Kedro PySpark : py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.
Python Kedro PySpark : py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.

Time:04-09

it's my first project using kedro with Pyspark and I have an issue. I work with the new Mac (M1). When I do spark-shell in the terminal, spark is successfully installed and I have the right output (welcome to spark version 3.2.1 with the picture). However, I tried to run spark using Kedro project, I have a trouble. I tried to find solutions thanks to stack overflow discussion but nothing linked with this.

Version:

  • Python : 3.8
  • Java : openjdk version "18" 2022-03-22
  • PySpark : 3.2.1

Spark conf :

spark.driver.maxResultSize: 3g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true

And in my project context of Kedro :

class ProjectContext(KedroContext):
    """A subclass of KedroContext to add Spark initialisation for the pipeline."""

    def __init__(
        self,
        package_name: str,
        project_path: Union[Path, str],
        env: str = None,
        extra_params: Dict[str, Any] = None,
    ):
        super().__init__(package_name, project_path, env, extra_params)
        if not os.getenv('DISABLE_SPARK'):
            self.init_spark_session()

    def init_spark_session(self) -> None:
        """Initialises a SparkSession using the config
        defined in project's conf folder.
        """

        parameters = self.config_loader.get("spark*", "spark*/**")
        spark_conf = SparkConf().setAll(parameters.items())

        # Initialise the spark session
        spark_session_conf = (
            SparkSession.builder.appName(self.package_name)
            .enableHiveSupport()
            .config(conf=spark_conf)
            .master("local[*]")
        )
        _spark_session = spark_session_conf.getOrCreate()

When I run it, I have this error :

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x3c60b7e7) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x3c60b7e7
    at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:213)
    at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
    at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:110)
    at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:348)
    at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:287)
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:336)
    at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:191)
    at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:460)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:833)

In my terminal, I adapted the commands to match my Python path :

export HOMEBREW_OPT="/opt/homebrew/opt"
export JAVA_HOME="$HOMEBREW_OPT/openjdk/"
export SPARK_HOME="$HOMEBREW_OPT/apache-spark/libexec"
export PATH="$JAVA_HOME:$SPARK_HOME:$PATH"
export SPARK_LOCAL_IP=localhost

Thanks you for your help

CodePudding user response:

Hi @Mathilde Roblot thanks for the detailed report -

The specific error 'cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module' sticks out to me.

Googling suggests that you may be retrieving the wrong Java (not 8.0 as required by spark)

CodePudding user response:

this also happens when your spark env libs are not being picked up by Kedro or Kedro is not able to find spark in your env.

QQ : are using an IDE like PyCharm , if that is the case , you might need to go to preferences and embed your env variables. I had faced the same problem and setting the env variables from the project preferences helped me

Hope this helps

  • Related