Home > Back-end >  pandas dataframe duplicate rows with means based on the unique value in one colum and so that each u
pandas dataframe duplicate rows with means based on the unique value in one colum and so that each u

Time:05-29

i have a dataframe like the the one below:

import numpy as np
import pandas as pd
df = pd.DataFrame({"Colum1":

               ["A", "A", "A", "A", "B", "B", "B", "C", "C", "D"],

               "Colum2": [24., 25, 21., 33, 26, 24., 25, 21., 33, 26],

               "Colum3": [0.2, 0.3, 0.4, 0.5, 0.6, 0.2, 0.3, 0.4, 0.5, 0.6]})
  Colum1  Colum2  Colum3
0      A    24.0     0.2
1      A    25.0     0.3
2      A    21.0     0.4
3      A    33.0     0.5
4      B    26.0     0.6
5      B    24.0     0.2
6      B    25.0     0.3
7      C    21.0     0.4
8      C    33.0     0.5
9      D    26.0     0.6

here i have 4 rows corresponding to A, 3 rows corresponding B, 2 for C and 1 for D.

How could i add rows for B, C and D to have the same number as A to have 4 rows using their mean values? for instance mean of B for colum2 is (26 25 24)/3 = 25 and for colum3 is (0.6 0.2 0.3)/3 = 0.37 so i add one row of B 25 0.37

for C, there are only 2 rows and means for colum 2 and 3 are 27 and 0.45 so i add two row of C 27 0.45

for D there is only one row, so we add three rows with the same values

the target is:

  Colum1  Colum2  Colum3
0      A    24.0     0.2
1      A    25.0     0.3
2      A    21.0     0.4
3      A    33.0     0.5
4      B    26.0     0.6
5      B    24.0     0.2
6      B    25.0     0.3
7      B    25.0     0.37
8      C    21.0     0.4
9      C    33.0     0.5
10     C    27.0     0.45
11     C    27.0     0.45
12     D    26.0     0.6
13     D    26.0     0.6
14     D    26.0     0.6
15     D    26.0     0.6

any one who knows a lot of Pandas could help me?

CodePudding user response:

Solution

g = df.groupby('Colum1')
avg, s = g.mean(), g.size()
rows = avg.loc[avg.index.repeat(s.max() - s)]

pd.concat([df, rows.reset_index()]).sort_values('Colum1')

Explained

Calculate mean and size of each group

>>> avg
        Colum2    Colum3
Colum1                  
A        25.75  0.350000
B        25.00  0.366667
C        27.00  0.450000
D        26.00  0.600000

>>> s
Colum1
A    4
B    3
C    2
D    1
dtype: int64

Repeat the rows of avg dataframe N number of times where N is obtained by subtracting size of each group from the max group size

>>> rows
        Colum2    Colum3
Colum1                  
B         25.0  0.366667
C         27.0  0.450000
C         27.0  0.450000
D         26.0  0.600000
D         26.0  0.600000
D         26.0  0.600000

Concat the original dataframe with sampled rows to get the dataframe having a balanced group distribution

  Colum1  Colum2    Colum3
0      A    24.0  0.200000
1      A    25.0  0.300000
2      A    21.0  0.400000
3      A    33.0  0.500000
4      B    26.0  0.600000
5      B    24.0  0.200000
6      B    25.0  0.300000
0      B    25.0  0.366667
7      C    21.0  0.400000
8      C    33.0  0.500000
1      C    27.0  0.450000
2      C    27.0  0.450000
9      D    26.0  0.600000
3      D    26.0  0.600000
4      D    26.0  0.600000
5      D    26.0  0.600000

CodePudding user response:

I think this is what you're looking for.

df = pd.DataFrame({"Colum1":

               ["A", "A", "A", "A", "B", "B", "B", "C", "C", "D"],

               "Colum2": [24., 25, 21., 33, 26, 24., 25, 21., 33, 26],

               "Colum3": [0.2, 0.3, 0.4, 0.5, 0.6, 0.2, 0.3, 0.4, 0.5, 0.6]})

# get max value 
max_val = df.groupby('Colum1').size().max()
# get mean for each group(A, B, C, D)
mean_df = df.groupby('Colum1').mean().reset_index()

new_df = []

for col, grp in df.groupby('Colum1'):
    append_list = [grp]
    if(len(grp) < max_val):
        for i in range(max_val - len(grp)):
            append_list.append(mean_df[mean_df['Colum1'] == col])
    new_df.append(pd.concat(append_list))
output_df = pd.concat(new_df)
output_df

CodePudding user response:

Here is an approach using reshaping:

(df
 .assign(idx=df.groupby('Colum1').cumcount())
 .pivot('idx', 'Colum1')
 .pipe(lambda d: d.fillna(d.mean()))
 .stack()
 .reset_index('Colum1')
 .sort_values(by='Colum1')
)

Output:

    Colum1  Colum2    Colum3
idx                         
0        A    24.0  0.200000
1        A    25.0  0.300000
2        A    21.0  0.400000
3        A    33.0  0.500000
0        B    26.0  0.600000
1        B    24.0  0.200000
2        B    25.0  0.300000
3        B    25.0  0.366667
0        C    21.0  0.400000
1        C    33.0  0.500000
2        C    27.0  0.450000
3        C    27.0  0.450000
0        D    26.0  0.600000
1        D    26.0  0.600000
2        D    26.0  0.600000
3        D    26.0  0.600000
  • Related