I have a dataframe df
id email firstname lastname salutation
2be858a0458faa569d3d [email protected] Lastname
2be858a0458faa569d3d [email protected] Firstname
2be858a0458faa569d3d [email protected] Firstname Mr
I want to have a consolidated dataframe df_consolidated
2be858a0458faa569d3d [email protected] Firstname Lastname Mr
The logic should be that it takes all values from df and "sums" them up to one row.
Any idea?
CodePudding user response:
You can use groupby_first
:
>>> df.groupby('id', as_index=False).first()
id email firstname lastname salutation
0 2be858a0458faa569d3d [email protected] Firstname Lastname Mr
If empty values are empty strings you can replace ''
by np.nan
first:
>>> df.replace({'': np.nan}).groupby('id', as_index=False).first()
id email firstname lastname salutation
0 2be858a0458faa569d3d [email protected] Firstname Lastname Mr
CodePudding user response:
You need some sort of identifier what is considered the "same".
If all rows are identical and you just want one, you can use:
df.drop_duplicates()
or the answer of @Corralien.
If all rows which should be aggregated share a specific trait, e.g. 'id' you can use:
df.groupby(['id']).apply(set)
which will return 1 row and a set of unique values for each id and column. A set in python is unordered, so if the order matters you can use the keys of a dictionary as a replacement, see here: Does Python have an ordered set?