Home > OS >  Collapsing a DataFrame based on a column
Collapsing a DataFrame based on a column

Time:02-21

I have a dataframe containing three columns: 'Bin No', 'Intervals' and 'Other', where the first two contain numerical values and the last contains values of object type. I'm trying to collapse the rows if a value in 'Bin No' exists more than once, keeping values from the other two columns. My current DataFrame looks like the following:

   Intervals  Bin No            Other
0        1.5       1                 
1        2.5       2                 
4        3.5       3                 
6        4.5       4                 
8        5.5       5                 
9        8.5       6                 
11      12.5       7                 
2       16.0       8                 
3       22.0       9                 
5       37.0      10                 
12                10  [-99999997, -3]
7       61.0      11                 
10      87.0      12                 
13                14               -2
14                15               -1

In this case 10 appears twice in 'Bin No' so the dataframe should look like:

   Bin No  Intervals            Other
0       1        1.5                 
7       2        2.5                 
8       3        3.5                 
9       4        4.5                 
10      5        5.5                 
11      6        8.5                 
12      7       12.5                 
13      8       16.0                 
14      9       22.0                 
2      10       37.0  [-99999997, -3]
3      11       61.0                 
4      12       87.0                 
5      14                          -2
6      15                          -1

I've managed to find a solution using the following code, however I'm concerned this may create issues later on with df.drop_duplicates:

df_one = df.groupby('Bin No', as_index=False).agg(''.join)
df_two = pd.merge(df_two, df.drop('Intervals', axis=1), how='left', on='Bin No')
df_three = df_two.drop_duplicates('Bin No', keep='last')

Is there a more effective and less ambiguous solution?

CodePudding user response:

You could replace the empty string with NaN values; then use groupby first (take advantage of the fact that first skips NaNs). You could replace NaN back to empty strings but I guess that's kind of redundant:

out = df.replace('', np.nan).groupby('Bin No', as_index=False)[['Intervals', 'Other']].first()
out['Other'] = out['Other'].fillna('')

Output:

    Bin_No  Intervals            Other
0        1        1.5                 
1        2        2.5                 
2        3        3.5                 
3        4        4.5                 
4        5        5.5                 
5        6        8.5                 
6        7       12.5                 
7        8       16.0                 
8        9       22.0                 
9       10       37.0  [-99999997, -3]
10      11       61.0                 
11      12       87.0                 
12      14        NaN               -2
13      15        NaN               -1
  • Related