Home > Back-end >  Combine duplicate rows into one row in Pandas data frame
Combine duplicate rows into one row in Pandas data frame

Time:09-22

Let's say I have the following data: one person can have multiple Constituency Code

Unique_ID Name Constituency Code
404 Mark Teacher
404 Mark Staff
404 Mark Staff
659 Julio Student
1025 Jasmine Staff
1025 Jasmine Student

Question: Using python and pandas, how would I combine them to this: List of unique id and name with all the constituency code combined. Notice that if there are duplicates in constituency code, but it combines it to 1 list of unique constituency code per Unique_ID

Unique_ID Name Constituency Code
404 Mark Teacher, Staff
659 Julio Student
1025 Jasmine Staff, Student

I have tried to use groupby and aggregate attributes on pandas, but I am not able to get the result I want. Any suggestions. I know I can do it by iterating through each row, but I rather not do that.

CodePudding user response:

Setting up data - very important to provide working examples

test_data = [
    [404, 'Mark', 'Teacher'],
    [404, 'Mark', 'Staff'],
    [404, 'Mark', 'Staff'],
    [659, 'Julio', 'Student'],
    [1025, 'Jasmine', 'Staff'],
    [1025, 'Jasmine', 'Student']
]
cols = ['Unique_ID', 'Name', 'Constinuency Code']

df = pd.DataFrame(test_data, columns=cols)

df.groupby(['Unique_ID', 'Name'])['Constinuency Code'].apply(lambda grp: list(set(grp))).reset_index()

Out

   Unique_ID     Name Constinuency Code
0        404     Mark  [Teacher, Staff]
1        659    Julio         [Student]
2       1025  Jasmine  [Student, Staff]

If you need a string instead:

df.groupby(['Unique_ID', 'Name'])['Constinuency Code'].apply(lambda grp: ', '.join(set(grp))).reset_index()

Out

   Unique_ID     Name Constinuency Code
0        404     Mark    Teacher, Staff
1        659    Julio           Student
2       1025  Jasmine    Student, Staff
  • Related