Home > Software design >  split row value by separator and create new columns
split row value by separator and create new columns

Time:02-02

I I have a Pyspark dataset with a column “channels” that looks like this:

channels

name1,name2,name3,name4
happy1,happy2
entity1,entity2,entity3,entity4,entity5

I want to create 5 new columns i.e “channel1, channel2, channel3, channel4, channel5”.

Then, I want to split the contents of the “channels” column using the comma separator. After splitting values from each row, I want to put each separated value in a different column.

For example for the first row, the columns should look like this:

channel1   channel2  channel3   channel4   channel5
name1       name2      name3      name4        ~

When an element is not found, i want to use ~ as the column value. For example in the first row, there were only 4 values instead of 5 so for the channel5 column, I used ~

I only want to use ~, not None or NULL.

How can I achieve this result in pyspark?

I tried this:

    df = df.withColumn("channels_split", split(df["channels"], ","))
    df = df.withColumn("channel1", coalesce(df["channels_split"][0], "~"))
    df = df.withColumn("channel2", coalesce(df["channels_split"][1], "~"))
    df = df.withColumn("channel3", coalesce(df["channels_split"][2], "~"))
    df = df.withColumn("channel4", coalesce(df["channels_split"][3], "~"))
    df = df.withColumn("channel5", coalesce(df["channels_split"][4], "~"))
    df = df.drop("channels_split") 

but it gives me an error that:

`~` is missing

You're referencing the column `~`, but it is missing from the schema. Please check your code.

Note that I am using pyspark within Foundry

CodePudding user response:

Coalesce expects cols as arguments and you are providing String, i think that you should use lit("~")

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.coalesce.html

  • Related