Home > Mobile >  PySpark - data mismatch error when trying to split a column content
PySpark - data mismatch error when trying to split a column content

Time:12-15

I'm trying to use PySpark's split() method on a column that has data formatted like:

[6b87587f-54d4-11eb-95a7-8cdcd41d1310, 603, landing-content, landing-content-provider]

my intent is to extract the 4th element after the last comma.

I'm using a syntax like:

mydf.select("primary_component").withColumn("primary_component_01",f.split(mydf.primary_component, "\,").getItem(0)).limit(10).show(truncate=False)

But I'm consistently getting this error:

"cannot resolve 'split(mydf.primary_component, ',')' due to data type mismatch: argument 1 requires string type, however, 'mydf.primary_component' is of structuuid:string,id:int,project:string,component:string type.;;\n'Project [primary_component#17, split(split(primary_component#17, ,)[1], \,)...

I've also tried escaping the "," using \, \\ or not escaping it at all and this doesn't make any difference. Also, removing the ".getItem(0)" produces no difference.

What am I doing wrong? Feeling a dumbass but I don't know how to fix this... Thank you for any suggestions

CodePudding user response:

You are getting the error:

"cannot resolve 'split(mydf.`primary_component`, ',')' due to data
type mismatch: argument 1 requires string type, however,
'mydf.`primary_component`' is of
struct<uuid:string,id:int,project:string,component:string>

because your column primary_component is using a struct type when split expects string columns.

Since primary_component is already a struct and you are interested in the value after your last comma you may try the following using dot notation

mydf.withColumn("primary_component_01","primary_component.component")

In the error message, spark has shared the schema for your struct as

struct<uuid:string,id:int,project:string,component:string>

i.e.

column data type
uuid string
id int
project string
component string

For future debugging purposes, you may use mydf.printSchema() to show the schema of the spark dataframe in use.

  • Related