dataframe --> df having a column for Full Name (First, middle & last). The column name is full_name and words are seperated by a space (delimiter) I'd like to create a new column having only 1st and middle name.
I have tried the following
df = df.withColumn('new_name', split(df['full_name'], ' '))
But this returns all the words in a list.
I also tried
df = df.withColumn('new_name', split(df['full_name'], ' ')).getItem(1)
But this returns only the 2nd name in the list (middle name)
Please advise how to proceed with this.
CodePudding user response:
Try this
import pyspark.sql.functions as F
split_col = F.split(df['FullName'], ' ')
df = df.withColumn('FirstMiddle', F.concat_ws(' ',split_col.getItem(0),split_col.getItem(1)))
df.show()
CodePudding user response:
Took my some time thinking but I came up with this
df1 = df.withColumn('first_name', f.split(df['full_name'], ' ').getItem(0))\
.withColumn('middle_name', f.split(df['full_name'], ' ').getItem(1))\
.withColumn('New_Name', f.concat(f.col('first_name'), f.lit(' '), f.col('middle_name')))\
.drop('first_name')\
.drop('middle_name')
It is a working code and the output is as expected but I am not sure how efficient this is considered. If someone has any better ideas please reply