Home > Software design >  Python Pandas extract duplicates from a csv and aggregate the values ​of a column
Python Pandas extract duplicates from a csv and aggregate the values ​of a column

Time:02-15

I have a CSV file consisting of 4 columns A, B, C, D. I would like to:

  • find all duplicates that have the same value for columns A, B, C
  • for these take the value of D and create a single row without duplicates, where column D is the union of column D of all duplicates

Example CSV input:

John,Yes,123,street 1
John,Yes,123,street 2
Tom,No,345,street 1
Tom,No,345,street 2
Tom,No,345,street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1

Expected result:

John,Yes,123,street 1 street 2
Tom,No,345,street 1 street 2 street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1

CodePudding user response:

df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index()

Full code:

from io import StringIO
df = """A,B,C,D
John,Yes,123,street 1
John,Yes,123,street 2
Tom,No,345,street 1
Tom,No,345,street 2
Tom,No,345,street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1"""
df = pd.read_csv(StringIO(df))
df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index()

Output:

A B C D
0 Jason Yes 567 street 1
1 Jess No 999 street 1
2 John Yes 123 street 1 street 2
3 Thomas No 123 street 1
4 Tom No 345 street 1 street 2 street 3

Explanation:

df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index() is the same as df.groupby(['A','B','C'])['D'].apply(lambda g: ' '.join(g.values)).reset_index() which is the same as these alternatives:

# alternative #1
df.groupby(['A','B','C'])['D'].apply(lambda g: ' '.join(g)).reset_index()
# alternative #2
df.groupby(['A','B','C']).apply(lambda g: ' '.join(g['D'])).reset_index()
  • Related