I ran this code and have an error.
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11,org.apache.hadoop:hadoop-aws:2.7.0").\
enableHiveSupport().getOrCreate()
df_spark_temp = spark.read.format('com.github.saurfang.sas.spark').load('18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
df_spark_temp.limit(5).toPandas().show()
py4j.protocol.Py4JJavaError: An error occurred while calling o34.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at com.github.saurfang.sas.spark.SasRelation.<init>(SasRelation.scala:48)
at com.github.saurfang.sas.spark.SasRelation$.apply(SasRelation.scala:42)
at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:50)
at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:39)
at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
... 23 more
Python version: 3.9.6
JAVA version: 17.0.4.1
Pyspark version: 3.3
I searched for same issues in stack overflow, and most of them said it is may because the scala version.
I haven't installed scala before, do I need to install scala or I can change the setting in JAVA?
And I type PySpark --version and it shows
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.0
/_/
Using Scala version 2.12.15, Java HotSpot(TM) 64-Bit Server VM, 17.0.4.1
Does it mean I need to install scala version 2.12.15 or I already have installed?
CodePudding user response:
All libraries must be compiled for the same Scala version you are running with.
I'm not familiar with PySpark but I see that at least spark-sas7bdat:2.0.0-s_2.11
seems to be compiled for Scala 2.11 given its version number.
If you're running with Scala 2.12, look for using spark-sas7bdat:3.0.0-s_2.12
instead.
Personal note: this library seems to not be maintained at all, consider using another one if that's for production code.