it's my first project using kedro with Pyspark and I have an issue. I work with the new Mac (M1). When I do spark-shell
in the terminal, spark is successfully installed and I have the right output (welcome to spark version 3.2.1 with the picture). However, I tried to run spark using Kedro project, I have a trouble. I tried to find solutions thanks to stack overflow discussion but nothing linked with this.
Version:
- Python : 3.8
- Java : openjdk version "18" 2022-03-22
- PySpark : 3.2.1
Spark conf :
spark.driver.maxResultSize: 3g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
And in my project context of Kedro :
class ProjectContext(KedroContext):
"""A subclass of KedroContext to add Spark initialisation for the pipeline."""
def __init__(
self,
package_name: str,
project_path: Union[Path, str],
env: str = None,
extra_params: Dict[str, Any] = None,
):
super().__init__(package_name, project_path, env, extra_params)
if not os.getenv('DISABLE_SPARK'):
self.init_spark_session()
def init_spark_session(self) -> None:
"""Initialises a SparkSession using the config
defined in project's conf folder.
"""
parameters = self.config_loader.get("spark*", "spark*/**")
spark_conf = SparkConf().setAll(parameters.items())
# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(self.package_name)
.enableHiveSupport()
.config(conf=spark_conf)
.master("local[*]")
)
_spark_session = spark_session_conf.getOrCreate()
When I run it, I have this error :
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x3c60b7e7) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x3c60b7e7
at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:213)
at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:110)
at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:348)
at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:287)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:336)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:191)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:460)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:833)
In my terminal, I adapted the commands to match my Python path :
export HOMEBREW_OPT="/opt/homebrew/opt"
export JAVA_HOME="$HOMEBREW_OPT/openjdk/"
export SPARK_HOME="$HOMEBREW_OPT/apache-spark/libexec"
export PATH="$JAVA_HOME:$SPARK_HOME:$PATH"
export SPARK_LOCAL_IP=localhost
Thanks you for your help
CodePudding user response:
Hi @Mathilde Roblot thanks for the detailed report -
The specific error 'cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module
' sticks out to me.
Googling suggests that you may be retrieving the wrong Java (not 8.0 as required by spark)
CodePudding user response:
this also happens when your spark env libs are not being picked up by Kedro or Kedro is not able to find spark in your env.
QQ : are using an IDE like PyCharm , if that is the case , you might need to go to preferences and embed your env variables. I had faced the same problem and setting the env variables from the project preferences helped me
Hope this helps