Home > Back-end >  Read csv that contains array of string in pyspark
Read csv that contains array of string in pyspark

Time:04-23

I'm trying to read a csv that has the following data:

name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3

using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe

If I give my own schema like:

    schema = StructType([
    StructField('name', StringType()),
    StructField('date', TimestampType()),
    StructField('win', Booleantype()),
    StructField('stops', ArrayType(StringType())),
    StructField('cost', DoubleType())])

results in this exception:

pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.

so how would I properly read the csv without this failure?

CodePudding user response:

Since csv doesn't support array, you need to first read as string, then convert it.

# You need to set escape option to ", since it is not the default escape character (\). 
df = spark.read.csv('file.csv', header=True, escape='"')

df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))

CodePudding user response:

I guess this is what you are looking for:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()


dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")

dataframe.printSchema()

Let me know if it helps

  • Related