I'm having an excel(.xlsx) file with following columns
Location Month Desc Position Budget
EUR 1/1/2020 In Europe Right 34%
AUS 1/1/2020 In Australia Left >22%
While reading this file in pandas df, Im facing issue in Budget col. Getting the below error:
field Budget: Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>
Could not convert '>22%' with type str: tried to convert to double
Im trying with this code:
from pyspark.sql import SparkSession
import pandas
spark = SparkSession.builder.appName("Test").getOrCreate()
pdf = pandas.read_excel(parent_path 'file1.xlsx', sheet_name='Sheet1')
fileSchema = StructType([
StructField("Location", StringType()),
StructField("Month", DateType()),
StructField("Desc", StringType()),
StructField("Position", StringType()),
StructField("Budget", StringType())])
pdf.fillna('')
df = spark.createDataFrame(pdf)
df.show()
I need to read multiple excel files. How to handle the datatype issue here? Any sugestions
CodePudding user response:
Looks like you could deal with that using a custom converter
:
def bcvt(x):
return float(x.replace('>','').replace('%',''))/100
dfd = pd.read_csv(r'd:\jchtempnew\t1.csv', converters={'Budget': bcvt})
dfd
Location Month Desc Position Budget
0 EUR 1/1/2020 In Europe Right 0.34
1 AUS 1/1/2020 In Australia Left 0.22
(Updated per @user128029
recommendation)