I I have a Pyspark dataset with a column “channels” that looks like this:
channels
name1,name2,name3,name4
happy1,happy2
entity1,entity2,entity3,entity4,entity5
I want to create 5 new columns i.e “channel1, channel2, channel3, channel4, channel5”.
Then, I want to split the contents of the “channels” column using the comma separator. After splitting values from each row, I want to put each separated value in a different column.
For example for the first row, the columns should look like this:
channel1 channel2 channel3 channel4 channel5
name1 name2 name3 name4 ~
When an element is not found, i want to use ~ as the column value. For example in the first row, there were only 4 values instead of 5 so for the channel5 column, I used ~
I only want to use ~, not None or NULL.
How can I achieve this result in pyspark?
I tried this:
df = df.withColumn("channels_split", split(df["channels"], ","))
df = df.withColumn("channel1", coalesce(df["channels_split"][0], "~"))
df = df.withColumn("channel2", coalesce(df["channels_split"][1], "~"))
df = df.withColumn("channel3", coalesce(df["channels_split"][2], "~"))
df = df.withColumn("channel4", coalesce(df["channels_split"][3], "~"))
df = df.withColumn("channel5", coalesce(df["channels_split"][4], "~"))
df = df.drop("channels_split")
but it gives me an error that:
`~` is missing
You're referencing the column `~`, but it is missing from the schema. Please check your code.
Note that I am using pyspark within Foundry
CodePudding user response:
Coalesce expects cols as arguments and you are providing String, i think that you should use lit("~")