handle datatype issue while reading excel file to pandas dataframe-CodePudding

I'm having an excel(.xlsx) file with following columns

Location    Month       Desc            Position    Budget
EUR         1/1/2020    In Europe       Right       34%
AUS         1/1/2020    In Australia    Left        >22%

While reading this file in pandas df, Im facing issue in Budget col. Getting the below error:

field Budget: Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>
Could not convert '>22%' with type str: tried to convert to double

Im trying with this code:

from pyspark.sql import SparkSession
import pandas

spark = SparkSession.builder.appName("Test").getOrCreate()

pdf = pandas.read_excel(parent_path 'file1.xlsx', sheet_name='Sheet1')

fileSchema = StructType([
  StructField("Location", StringType()),
  StructField("Month", DateType()),
  StructField("Desc", StringType()),
  StructField("Position", StringType()),
  StructField("Budget", StringType())])

pdf.fillna('')
df = spark.createDataFrame(pdf)

df.show()

I need to read multiple excel files. How to handle the datatype issue here? Any sugestions

CodePudding user response：

Looks like you could deal with that using a custom converter:

def bcvt(x):
    return float(x.replace('>','').replace('%',''))/100

dfd = pd.read_csv(r'd:\jchtempnew\t1.csv', converters={'Budget': bcvt})

dfd 

  Location     Month          Desc Position  Budget
0      EUR  1/1/2020     In Europe    Right    0.34
1      AUS  1/1/2020  In Australia     Left    0.22

(Updated per @user128029 recommendation)