i have a dataframe like the the one below:
import numpy as np
import pandas as pd
df = pd.DataFrame({"Colum1":
["A", "A", "A", "A", "B", "B", "B", "C", "C", "D"],
"Colum2": [24., 25, 21., 33, 26, 24., 25, 21., 33, 26],
"Colum3": [0.2, 0.3, 0.4, 0.5, 0.6, 0.2, 0.3, 0.4, 0.5, 0.6]})
Colum1 Colum2 Colum3
0 A 24.0 0.2
1 A 25.0 0.3
2 A 21.0 0.4
3 A 33.0 0.5
4 B 26.0 0.6
5 B 24.0 0.2
6 B 25.0 0.3
7 C 21.0 0.4
8 C 33.0 0.5
9 D 26.0 0.6
here i have 4 rows corresponding to A, 3 rows corresponding B, 2 for C and 1 for D.
How could i add rows for B, C and D to have the same number as A to have 4 rows using their mean values? for instance mean of B for colum2 is (26 25 24)/3 = 25 and for colum3 is (0.6 0.2 0.3)/3 = 0.37 so i add one row of B 25 0.37
for C, there are only 2 rows and means for colum 2 and 3 are 27 and 0.45 so i add two row of C 27 0.45
for D there is only one row, so we add three rows with the same values
the target is:
Colum1 Colum2 Colum3
0 A 24.0 0.2
1 A 25.0 0.3
2 A 21.0 0.4
3 A 33.0 0.5
4 B 26.0 0.6
5 B 24.0 0.2
6 B 25.0 0.3
7 B 25.0 0.37
8 C 21.0 0.4
9 C 33.0 0.5
10 C 27.0 0.45
11 C 27.0 0.45
12 D 26.0 0.6
13 D 26.0 0.6
14 D 26.0 0.6
15 D 26.0 0.6
any one who knows a lot of Pandas could help me?
CodePudding user response:
Solution
g = df.groupby('Colum1')
avg, s = g.mean(), g.size()
rows = avg.loc[avg.index.repeat(s.max() - s)]
pd.concat([df, rows.reset_index()]).sort_values('Colum1')
Explained
Calculate mean
and size
of each group
>>> avg
Colum2 Colum3
Colum1
A 25.75 0.350000
B 25.00 0.366667
C 27.00 0.450000
D 26.00 0.600000
>>> s
Colum1
A 4
B 3
C 2
D 1
dtype: int64
Repeat
the rows of avg
dataframe N number of times where N is obtained by subtracting size of each group from the max group size
>>> rows
Colum2 Colum3
Colum1
B 25.0 0.366667
C 27.0 0.450000
C 27.0 0.450000
D 26.0 0.600000
D 26.0 0.600000
D 26.0 0.600000
Concat
the original dataframe with sampled rows
to get the dataframe having a balanced group distribution
Colum1 Colum2 Colum3
0 A 24.0 0.200000
1 A 25.0 0.300000
2 A 21.0 0.400000
3 A 33.0 0.500000
4 B 26.0 0.600000
5 B 24.0 0.200000
6 B 25.0 0.300000
0 B 25.0 0.366667
7 C 21.0 0.400000
8 C 33.0 0.500000
1 C 27.0 0.450000
2 C 27.0 0.450000
9 D 26.0 0.600000
3 D 26.0 0.600000
4 D 26.0 0.600000
5 D 26.0 0.600000
CodePudding user response:
I think this is what you're looking for.
df = pd.DataFrame({"Colum1":
["A", "A", "A", "A", "B", "B", "B", "C", "C", "D"],
"Colum2": [24., 25, 21., 33, 26, 24., 25, 21., 33, 26],
"Colum3": [0.2, 0.3, 0.4, 0.5, 0.6, 0.2, 0.3, 0.4, 0.5, 0.6]})
# get max value
max_val = df.groupby('Colum1').size().max()
# get mean for each group(A, B, C, D)
mean_df = df.groupby('Colum1').mean().reset_index()
new_df = []
for col, grp in df.groupby('Colum1'):
append_list = [grp]
if(len(grp) < max_val):
for i in range(max_val - len(grp)):
append_list.append(mean_df[mean_df['Colum1'] == col])
new_df.append(pd.concat(append_list))
output_df = pd.concat(new_df)
output_df
CodePudding user response:
Here is an approach using reshaping:
(df
.assign(idx=df.groupby('Colum1').cumcount())
.pivot('idx', 'Colum1')
.pipe(lambda d: d.fillna(d.mean()))
.stack()
.reset_index('Colum1')
.sort_values(by='Colum1')
)
Output:
Colum1 Colum2 Colum3
idx
0 A 24.0 0.200000
1 A 25.0 0.300000
2 A 21.0 0.400000
3 A 33.0 0.500000
0 B 26.0 0.600000
1 B 24.0 0.200000
2 B 25.0 0.300000
3 B 25.0 0.366667
0 C 21.0 0.400000
1 C 33.0 0.500000
2 C 27.0 0.450000
3 C 27.0 0.450000
0 D 26.0 0.600000
1 D 26.0 0.600000
2 D 26.0 0.600000
3 D 26.0 0.600000