Home > OS >  Concatenation of multiple columns
Concatenation of multiple columns

Time:05-05

I have three columns in my data frame:

CaseID FirstName LastName
1 rohit pandey
2 rai
3

In the output, I am trying to add the fourth column and have values as LastName,FirstName

I have this Python code

df_ids['ContactName'] = df_ids[['LastName', 'FirstName']].agg(lambda x: ','.join(x.values), axis=1)

But it appends the blank values also which something like below that I am able to get like below:

CaseID FirstName LastName ContactName
1 rohit pandey pandey, rohit
2 rai , rai
3 ,

The expected output:

CaseID FirstName LastName ContactName
1 rohit pandey pandey, rohit
2 rai rai
3

CodePudding user response:

This is the easy way, using apply. apply takes each row one at a time and passes it to the given function.

import pandas as pd
data = [
    [ 1, 'rohit', 'pandey' ],
    [ 2, '', 'rai' ],
    [ 3, '', '' ]
]
df = pd.DataFrame(data, columns=['CaseID', 'FirstName', 'LastName'] )

def fixup( row ):
    if not row['LastName']:
        return ''
    if not row['FirstName']:
        return row['LastName']
    return row['LastName']   ', '   row['FirstName']

print(df)
df['Contact1'] = df.apply(fixup, axis=1)
print(df)

Output:

   CaseID FirstName LastName
0       1     rohit   pandey
1       2                rai
2       3                   
   CaseID FirstName LastName       Contact1
0       1     rohit   pandey  pandey, rohit
1       2                rai            rai
2       3                                  

CodePudding user response:

Two (actually 1 and a half) other options, which are very close to your attempt:

df_ids['ContactName'] = (
    df_ids[['LastName', 'FirstName']]
    .agg(lambda row: ', '.join(name for name in row if name), axis=1)
)

or

df_ids['ContactName'] = (
    df_ids[['LastName', 'FirstName']]
    .agg(lambda row: ', '.join(filter(None, row)), axis=1)
)

In both version the ''s are filtered out:

  1. Via a generator expression: The if name makes sure that '' isn't allowed, because its truth value is False - try print(bool('')).
  2. By the built-in function filter() with the first argument set to None.

CodePudding user response:

Someone has added PySpark tag. This is PySpark version:

from pyspark.sql import functions as F

df_ids = df_ids.replace('', None)  # Replaces empty strings with nulls
df_ids = df_ids.withColumn('ContactName', F.concat_ws(', ', 'LastName', 'FirstName'))
df_ids = df_ids.fillna('')  # Replaces nulls back to empty strings

df_ids.show()
#  ------ --------- -------- ------------- 
# |CaseID|FirstName|LastName|  ContactName|
#  ------ --------- -------- ------------- 
# |     1|    rohit|  pandey|pandey, rohit|
# |     2|         |     rai|          rai|
# |     3|         |        |             |
#  ------ --------- -------- ------------- 
  • Related