I'm trying to read an excel file with spark using jupyter in vscode,with java version of 1.8.0_311 (Oracle Corporation), and scala version of version 2.12.15.
Here is the code below:
# import necessary library
import pandas as pd
from pyspark.sql.types import StructType
# entry point for spark's functionality
from pyspark import SparkContext, SparkConf, SQLContext
configure = SparkConf().setAppName("name").setMaster("local")
sc = SparkContext(conf= configure)
sql = SQLContext(sc)
# entry point for spark's dataframes
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local") \
.appName("pharmacy scraper") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
.getOrCreate()
# reading excel file
df_generika = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").option("inferSchema", "true").option("dataAddress", "Sheet1").load("./../data/raw-data/generika.xlsx")
Unfortunately, it produces an error
Py4JJavaError: An error occurred while calling o36.load.
: java.lang.ClassNotFoundException:
Failed to find data source: com.crealytics.spark.excel. Please find packages at
http://spark.apache.org/third-party-projects.html
CodePudding user response:
Check your Classpath: you must have the Jar containing com.crealytics.spark.excel in it.
With Spark, the architecture is a bit different than traditional applications. You may need to have the Jar at different location: in your application, at the master level, and/or worker level. Ingestion (what you’re doing) is done by the worker, so make sure they have this Jar in their classpath.
CodePudding user response:
Perhaps you should not initialize your SparkContext
separately, at all. Just create a SparkSession
with config settings, and everything is going to be fine.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local") \
.appName("pharmacy scraper") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
.getOrCreate()
: