How do I groupby one column in a dataframe and combine on another column no matter if there is text-CodePudding

I want to groupby on one column of a Dataframe and then combine any text in another column for the unique values of the first column (call_ID). The code should also delete any duplicates. My problem seems to be in deleting the duplicates. I tried the code below but it is not successful.

# Dataframe
dftest0 = pd.DataFrame(data ={'call_ID':[5423684,5423684,5423684,5423684],
'other_comb_words':['','','inspection','inspection']})

# Change datatype
dftest0['call_ID'] = dftest0['call_ID'].astype(str)

# groupby and combine text
dftest0['other_comb_words'] = dftest0.groupby(['call_ID'], as_index=False)
['other_comb_words'].transform(lambda x : ' '.join(x))

# remove duplicates
dftest0 = dftest0.drop_duplicates(subset='other_comb_words')  
dftest0

Dataframe sample:

call_ID other_comb_words
5423684 
5423684 
5423684 inspection
5423684 inspection

Current output:

call_ID other_comb_words
5423684 inspection inspection

Desired output:

call_ID other_comb_words
5423684 inspection

CodePudding user response：

One way to do that could be use .agg and .unique:

# Dataframe
...

# Change datatype
...

# groupby and combine text
df = (
    dftest0.groupby('call_ID', as_index=False)['other_comb_words']
           .agg(lambda x: ' '.join(x.unique()))
)

Result:

   call_ID other_comb_words
0  5423684       inspection

CodePudding user response：

Put your line that drop duplicates also above your groupby line but without parameters like this (I added more data as example to confirm the expected result):

import pandas as pd

# Dataframe
dftest0 = pd.DataFrame(
    data ={'call_ID':[5423684,5423684,5423684,5423684,5423684,1234567],
    'other_comb_words':['','','inspection','inspection','example_1','example_2']})

# Change datatype
dftest0['call_ID'] = dftest0['call_ID'].astype(str)

# remove duplicates
dftest0 = dftest0.drop_duplicates()

# groupby and combine text
dftest0['other_comb_words'] = dftest0.groupby(
    ['call_ID'], as_index=False)['other_comb_words'].transform(lambda x : ' '.join(x))

# remove duplicates from subsets
dftest0 = dftest0.drop_duplicates()

print(dftest0)

Result:

   call_ID       other_comb_words
0  5423684   inspection example_1
5  1234567              example_2

This is because you need to remove the duplicates from the original dataset, then after the groupby combination