I want to groupby
on one column of a Dataframe
and then combine any text in another column for the unique values of the first column (call_ID). The code should also delete any duplicates. My problem seems to be in deleting the duplicates. I tried the code below but it is not successful.
# Dataframe
dftest0 = pd.DataFrame(data ={'call_ID':[5423684,5423684,5423684,5423684],
'other_comb_words':['','','inspection','inspection']})
# Change datatype
dftest0['call_ID'] = dftest0['call_ID'].astype(str)
# groupby and combine text
dftest0['other_comb_words'] = dftest0.groupby(['call_ID'], as_index=False)
['other_comb_words'].transform(lambda x : ' '.join(x))
# remove duplicates
dftest0 = dftest0.drop_duplicates(subset='other_comb_words')
dftest0
Dataframe
sample:
call_ID other_comb_words
5423684
5423684
5423684 inspection
5423684 inspection
Current output:
call_ID other_comb_words
5423684 inspection inspection
Desired output:
call_ID other_comb_words
5423684 inspection
CodePudding user response:
One way to do that could be use .agg
and .unique
:
# Dataframe
...
# Change datatype
...
# groupby and combine text
df = (
dftest0.groupby('call_ID', as_index=False)['other_comb_words']
.agg(lambda x: ' '.join(x.unique()))
)
Result:
call_ID other_comb_words
0 5423684 inspection
CodePudding user response:
Put your line that drop duplicates also above your groupby line but without parameters like this (I added more data as example to confirm the expected result):
import pandas as pd
# Dataframe
dftest0 = pd.DataFrame(
data ={'call_ID':[5423684,5423684,5423684,5423684,5423684,1234567],
'other_comb_words':['','','inspection','inspection','example_1','example_2']})
# Change datatype
dftest0['call_ID'] = dftest0['call_ID'].astype(str)
# remove duplicates
dftest0 = dftest0.drop_duplicates()
# groupby and combine text
dftest0['other_comb_words'] = dftest0.groupby(
['call_ID'], as_index=False)['other_comb_words'].transform(lambda x : ' '.join(x))
# remove duplicates from subsets
dftest0 = dftest0.drop_duplicates()
print(dftest0)
Result:
call_ID other_comb_words
0 5423684 inspection example_1
5 1234567 example_2
This is because you need to remove the duplicates from the original dataset, then after the groupby combination