Home > Enterprise >  Pyspark Dataframe - How to create new column with only first 2 words
Pyspark Dataframe - How to create new column with only first 2 words

Time:01-06

dataframe --> df having a column for Full Name (First, middle & last). The column name is full_name and words are seperated by a space (delimiter) I'd like to create a new column having only 1st and middle name.

I have tried the following

df = df.withColumn('new_name', split(df['full_name'], ' '))

But this returns all the words in a list.

I also tried

df = df.withColumn('new_name', split(df['full_name'], ' ')).getItem(1)

But this returns only the 2nd name in the list (middle name)

Please advise how to proceed with this.

CodePudding user response:

Try this

import pyspark.sql.functions as F
split_col = F.split(df['FullName'], ' ')
df = df.withColumn('FirstMiddle', F.concat_ws(' ',split_col.getItem(0),split_col.getItem(1)))
df.show()

CodePudding user response:

Took my some time thinking but I came up with this

df1 = df.withColumn('first_name', f.split(df['full_name'], ' ').getItem(0))\
        .withColumn('middle_name', f.split(df['full_name'], ' ').getItem(1))\
        .withColumn('New_Name', f.concat(f.col('first_name'), f.lit(' '), f.col('middle_name')))\
        .drop('first_name')\
        .drop('middle_name')

It is a working code and the output is as expected but I am not sure how efficient this is considered. If someone has any better ideas please reply

  • Related