I'm trying to use PySpark's split()
method on a column that has data formatted like:
[6b87587f-54d4-11eb-95a7-8cdcd41d1310, 603, landing-content, landing-content-provider]
my intent is to extract the 4th element after the last comma.
I'm using a syntax like:
mydf.select("primary_component").withColumn("primary_component_01",f.split(mydf.primary_component, "\,").getItem(0)).limit(10).show(truncate=False)
But I'm consistently getting this error:
"cannot resolve 'split(mydf.
primary_component
, ',')' due to data type mismatch: argument 1 requires string type, however, 'mydf.primary_component
' is of structuuid:string,id:int,project:string,component:string type.;;\n'Project [primary_component#17, split(split(primary_component#17, ,)[1], \,)...
I've also tried escaping the "," using \, \\ or not escaping it at all and this doesn't make any difference. Also, removing the ".getItem(0)" produces no difference.
What am I doing wrong? Feeling a dumbass but I don't know how to fix this... Thank you for any suggestions
CodePudding user response:
You are getting the error:
"cannot resolve 'split(mydf.`primary_component`, ',')' due to data
type mismatch: argument 1 requires string type, however,
'mydf.`primary_component`' is of
struct<uuid:string,id:int,project:string,component:string>
because your column primary_component
is using a struct type when split
expects string
columns.
Since primary_component
is already a struct and you are interested in the value after your last comma you may try the following using dot notation
mydf.withColumn("primary_component_01","primary_component.component")
In the error message, spark has shared the schema for your struct as
struct<uuid:string,id:int,project:string,component:string>
i.e.
column | data type |
---|---|
uuid | string |
id | int |
project | string |
component | string |
For future debugging purposes, you may use mydf.printSchema()
to show the schema of the spark dataframe in use.