Home > database >  Consolidate a dataframe based on conditions
Consolidate a dataframe based on conditions

Time:01-11

I have a dataframe df

    id  email   firstname   lastname    salutation
    2be858a0458faa569d3d    [email protected]                Lastname    
    2be858a0458faa569d3d    [email protected]    Firstname       
    2be858a0458faa569d3d    [email protected]    Firstname                Mr

I want to have a consolidated dataframe df_consolidated

    2be858a0458faa569d3d    [email protected]    Firstname   Lastname Mr

The logic should be that it takes all values from df and "sums" them up to one row.

Any idea?

CodePudding user response:

You can use groupby_first:

>>> df.groupby('id', as_index=False).first()

                     id             email  firstname  lastname salutation
0  2be858a0458faa569d3d  [email protected]  Firstname  Lastname         Mr

If empty values are empty strings you can replace '' by np.nan first:

>>> df.replace({'': np.nan}).groupby('id', as_index=False).first()

                     id             email  firstname  lastname salutation
0  2be858a0458faa569d3d  [email protected]  Firstname  Lastname         Mr

CodePudding user response:

You need some sort of identifier what is considered the "same".

If all rows are identical and you just want one, you can use:

df.drop_duplicates()

or the answer of @Corralien.

If all rows which should be aggregated share a specific trait, e.g. 'id' you can use:

df.groupby(['id']).apply(set)

which will return 1 row and a set of unique values for each id and column. A set in python is unordered, so if the order matters you can use the keys of a dictionary as a replacement, see here: Does Python have an ordered set?

  • Related