Combine duplicate rows into one row in Pandas data frame-CodePudding

Let's say I have the following data: one person can have multiple Constituency Code

Unique_ID	Name	Constituency Code
404	Mark	Teacher
404	Mark	Staff
404	Mark	Staff
659	Julio	Student
1025	Jasmine	Staff
1025	Jasmine	Student

Question: Using python and pandas, how would I combine them to this: List of unique id and name with all the constituency code combined. Notice that if there are duplicates in constituency code, but it combines it to 1 list of unique constituency code per Unique_ID

Unique_ID	Name	Constituency Code
404	Mark	Teacher, Staff
659	Julio	Student
1025	Jasmine	Staff, Student

I have tried to use groupby and aggregate attributes on pandas, but I am not able to get the result I want. Any suggestions. I know I can do it by iterating through each row, but I rather not do that.

CodePudding user response：

Setting up data - very important to provide working examples

test_data = [
    [404, 'Mark', 'Teacher'],
    [404, 'Mark', 'Staff'],
    [404, 'Mark', 'Staff'],
    [659, 'Julio', 'Student'],
    [1025, 'Jasmine', 'Staff'],
    [1025, 'Jasmine', 'Student']
]
cols = ['Unique_ID', 'Name', 'Constinuency Code']

df = pd.DataFrame(test_data, columns=cols)

df.groupby(['Unique_ID', 'Name'])['Constinuency Code'].apply(lambda grp: list(set(grp))).reset_index()

Out

   Unique_ID     Name Constinuency Code
0        404     Mark  [Teacher, Staff]
1        659    Julio         [Student]
2       1025  Jasmine  [Student, Staff]

If you need a string instead:

df.groupby(['Unique_ID', 'Name'])['Constinuency Code'].apply(lambda grp: ', '.join(set(grp))).reset_index()

Out

   Unique_ID     Name Constinuency Code
0        404     Mark    Teacher, Staff
1        659    Julio           Student
2       1025  Jasmine    Student, Staff