Home > database >  How do I groupby one column in a dataframe and combine on another column no matter if there is text
How do I groupby one column in a dataframe and combine on another column no matter if there is text

Time:02-19

I want to groupby on one column of a Dataframe and then combine any text in another column for the unique values of the first column (call_ID). The code should also delete any duplicates. My problem seems to be in deleting the duplicates. I tried the code below but it is not successful.

# Dataframe
dftest0 = pd.DataFrame(data ={'call_ID':[5423684,5423684,5423684,5423684],
'other_comb_words':['','','inspection','inspection']})

# Change datatype
dftest0['call_ID'] = dftest0['call_ID'].astype(str)

# groupby and combine text
dftest0['other_comb_words'] = dftest0.groupby(['call_ID'], as_index=False)
['other_comb_words'].transform(lambda x : ' '.join(x))

# remove duplicates
dftest0 = dftest0.drop_duplicates(subset='other_comb_words')  
dftest0

Dataframe sample:

call_ID other_comb_words
5423684 
5423684 
5423684 inspection
5423684 inspection

Current output:

call_ID other_comb_words
5423684 inspection inspection

Desired output:

call_ID other_comb_words
5423684 inspection

CodePudding user response:

One way to do that could be use .agg and .unique:

# Dataframe
...

# Change datatype
...

# groupby and combine text
df = (
    dftest0.groupby('call_ID', as_index=False)['other_comb_words']
           .agg(lambda x: ' '.join(x.unique()))
)

Result:

   call_ID other_comb_words
0  5423684       inspection

CodePudding user response:

Put your line that drop duplicates also above your groupby line but without parameters like this (I added more data as example to confirm the expected result):

import pandas as pd

# Dataframe
dftest0 = pd.DataFrame(
    data ={'call_ID':[5423684,5423684,5423684,5423684,5423684,1234567],
    'other_comb_words':['','','inspection','inspection','example_1','example_2']})

# Change datatype
dftest0['call_ID'] = dftest0['call_ID'].astype(str)

# remove duplicates
dftest0 = dftest0.drop_duplicates()

# groupby and combine text
dftest0['other_comb_words'] = dftest0.groupby(
    ['call_ID'], as_index=False)['other_comb_words'].transform(lambda x : ' '.join(x))

# remove duplicates from subsets
dftest0 = dftest0.drop_duplicates()

print(dftest0)

Result:

   call_ID       other_comb_words
0  5423684   inspection example_1
5  1234567              example_2

This is because you need to remove the duplicates from the original dataset, then after the groupby combination

  • Related