Python Pandas extract duplicates from a csv and aggregate the values of a column-CodePudding

I have a CSV file consisting of 4 columns A, B, C, D. I would like to:

find all duplicates that have the same value for columns A, B, C
for these take the value of D and create a single row without duplicates, where column D is the union of column D of all duplicates

Example CSV input:

John,Yes,123,street 1
John,Yes,123,street 2
Tom,No,345,street 1
Tom,No,345,street 2
Tom,No,345,street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1

Expected result:

John,Yes,123,street 1 street 2
Tom,No,345,street 1 street 2 street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1

CodePudding user response：

df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index()

Full code:

from io import StringIO
df = """A,B,C,D
John,Yes,123,street 1
John,Yes,123,street 2
Tom,No,345,street 1
Tom,No,345,street 2
Tom,No,345,street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1"""
df = pd.read_csv(StringIO(df))
df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index()

Output:

	A	B	C	D
0	Jason	Yes	567	street 1
1	Jess	No	999	street 1
2	John	Yes	123	street 1 street 2
3	Thomas	No	123	street 1
4	Tom	No	345	street 1 street 2 street 3

Explanation:

df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index() is the same as df.groupby(['A','B','C'])['D'].apply(lambda g: ' '.join(g.values)).reset_index() which is the same as these alternatives:

# alternative #1
df.groupby(['A','B','C'])['D'].apply(lambda g: ' '.join(g)).reset_index()
# alternative #2
df.groupby(['A','B','C']).apply(lambda g: ' '.join(g['D'])).reset_index()