Home > Net >  Reading Excel file Using PySpark: Failed to find data source: com.crealytics.spark.excel
Reading Excel file Using PySpark: Failed to find data source: com.crealytics.spark.excel

Time:12-25

I'm trying to read an excel file with spark using jupyter in vscode,with java version of 1.8.0_311 (Oracle Corporation), and scala version of version 2.12.15.

Here is the code below:

# import necessary library 
import pandas as pd 
from pyspark.sql.types import StructType

# entry point for spark's functionality 
from pyspark import SparkContext, SparkConf, SQLContext 
    
configure = SparkConf().setAppName("name").setMaster("local")
sc = SparkContext(conf= configure)
sql = SQLContext(sc)

# entry point for spark's dataframes
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("pharmacy scraper") \
    .config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
    .getOrCreate()

# reading excel file 
df_generika = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").option("inferSchema", "true").option("dataAddress", "Sheet1").load("./../data/raw-data/generika.xlsx")

Unfortunately, it produces an error

Py4JJavaError: An error occurred while calling o36.load.
: java.lang.ClassNotFoundException: 
Failed to find data source: com.crealytics.spark.excel. Please find packages at
http://spark.apache.org/third-party-projects.html

CodePudding user response:

Check your Classpath: you must have the Jar containing com.crealytics.spark.excel in it.

With Spark, the architecture is a bit different than traditional applications. You may need to have the Jar at different location: in your application, at the master level, and/or worker level. Ingestion (what you’re doing) is done by the worker, so make sure they have this Jar in their classpath.

CodePudding user response:

Perhaps you should not initialize your SparkContext separately, at all. Just create a SparkSession with config settings, and everything is going to be fine.

configure = SparkConf().setAppName("name").setMaster("local") sc = SparkContext(conf= configure) sql = SQLContext(sc) # entry point for spark's dataframes
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .master("local") \
    .appName("pharmacy scraper") \
    .config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
    .getOrCreate()
:
  • Related