This is a very strange dataset that I do not know how to preprocess as following example:
Year, ID, feature1, feature2, target1
2008, 1, 10, 20, 5
2008, 1, 12, 25, 6
2008, 1, NaN, NaN, 4
2008, 1, NaN, NaN, 7
2008, 1, NaN, NaN, 3
2008, 1, NaN, NaN, 5
2008, 2, 22, 16, 7
2008, 2, 24, 14, 3
2008, 2, NaN, NaN, 5
2008, 2, NaN, NaN, 6
2008, 2, NaN, NaN, 9
2008, 3, 12, 15, 6
2008, 3, NaN, NaN, 1
....
The question is that I would like to replace the first two entries with value by their average, AND also fill the NaN with average from the first two values for columns feature1
and feature2
. If there is only one column that has an entry like ID == 3
, I will just ffill.
Example output:
Year, ID, feature1, feature2, target1
2008, 1, 11, 22.5, 5
2008, 1, 11, 22.5, 6
2008, 1, 11, 22.5, 4
2008, 1, 11, 22.5, 7
2008, 1, 11, 22.5, 3
2008, 1, 11, 22.5, 5
2008, 2, 23, 15, 7
2008, 2, 23, 15, 3
2008, 2, 23, 15, 5
2008, 2, 23, 15, 6
2008, 2, 23, 15, 9
2008, 3, 12, 15, 6
2008, 3, 12, 15, 1
....
Is there a way I can do that?
CodePudding user response:
Try with transform
mean
g = df.groupby(['Year','ID'])
df['feature1'] = g['feature1'].transform('mean')
df['feature2'] = g['feature2'].transform('mean')
CodePudding user response:
Use groupby_transform
to update the values of feature1
and feature2
columns:
df.update(df.groupby(['Year', 'ID'])['feature1', 'feature2'].transform('mean'))
print(df)
# Output:
Year ID feature1 feature2 target1
0 2008 1 11.0 22.5 5
1 2008 1 11.0 22.5 6
2 2008 1 11.0 22.5 4
3 2008 1 11.0 22.5 7
4 2008 1 11.0 22.5 3
5 2008 1 11.0 22.5 5
6 2008 2 23.0 15.0 7
7 2008 2 23.0 15.0 3
8 2008 2 23.0 15.0 5
9 2008 2 23.0 15.0 6
10 2008 2 23.0 15.0 9
11 2008 3 12.0 15.0 6
12 2008 3 12.0 15.0 1