I have a CSV file consisting of 4 columns A, B, C, D. I would like to:
- find all duplicates that have the same value for columns A, B, C
- for these take the value of D and create a single row without duplicates, where column D is the union of column D of all duplicates
Example CSV input:
John,Yes,123,street 1
John,Yes,123,street 2
Tom,No,345,street 1
Tom,No,345,street 2
Tom,No,345,street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1
Expected result:
John,Yes,123,street 1 street 2
Tom,No,345,street 1 street 2 street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1
CodePudding user response:
df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index()
Full code:
from io import StringIO
df = """A,B,C,D
John,Yes,123,street 1
John,Yes,123,street 2
Tom,No,345,street 1
Tom,No,345,street 2
Tom,No,345,street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1"""
df = pd.read_csv(StringIO(df))
df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index()
Output:
A | B | C | D | |
---|---|---|---|---|
0 | Jason | Yes | 567 | street 1 |
1 | Jess | No | 999 | street 1 |
2 | John | Yes | 123 | street 1 street 2 |
3 | Thomas | No | 123 | street 1 |
4 | Tom | No | 345 | street 1 street 2 street 3 |
Explanation:
df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index()
is the same as
df.groupby(['A','B','C'])['D'].apply(lambda g: ' '.join(g.values)).reset_index()
which is the same as these alternatives:
# alternative #1
df.groupby(['A','B','C'])['D'].apply(lambda g: ' '.join(g)).reset_index()
# alternative #2
df.groupby(['A','B','C']).apply(lambda g: ' '.join(g['D'])).reset_index()