Pandas replacing the entries and NaNs with the average of first two entries-CodePudding

This is a very strange dataset that I do not know how to preprocess as following example:

Year, ID, feature1, feature2, target1
2008, 1,  10,       20,       5
2008, 1,  12,       25,       6
2008, 1,  NaN,      NaN,      4
2008, 1,  NaN,      NaN,      7
2008, 1,  NaN,      NaN,      3
2008, 1,  NaN,      NaN,      5
2008, 2,  22,       16,       7
2008, 2,  24,       14,       3
2008, 2,  NaN,      NaN,      5
2008, 2,  NaN,      NaN,      6
2008, 2,  NaN,      NaN,      9
2008, 3,  12,       15,       6
2008, 3,  NaN,      NaN,      1
....

The question is that I would like to replace the first two entries with value by their average, AND also fill the NaN with average from the first two values for columns feature1 and feature2. If there is only one column that has an entry like ID == 3, I will just ffill.

Example output:

Year, ID, feature1, feature2, target1
2008, 1,  11,       22.5,     5
2008, 1,  11,       22.5,     6
2008, 1,  11,       22.5,     4
2008, 1,  11,       22.5,     7
2008, 1,  11,       22.5,     3
2008, 1,  11,       22.5,     5
2008, 2,  23,       15,       7
2008, 2,  23,       15,       3
2008, 2,  23,       15,       5
2008, 2,  23,       15,       6
2008, 2,  23,       15,       9
2008, 3,  12,       15,       6
2008, 3,  12,       15,       1
....

Is there a way I can do that?

CodePudding user response：

Try with transform mean

g = df.groupby(['Year','ID'])
df['feature1'] = g['feature1'].transform('mean')
df['feature2'] = g['feature2'].transform('mean')

CodePudding user response：

Use groupby_transform to update the values of feature1 and feature2 columns:

df.update(df.groupby(['Year', 'ID'])['feature1', 'feature2'].transform('mean'))
print(df)

# Output:
    Year  ID  feature1  feature2  target1
0   2008   1      11.0      22.5        5
1   2008   1      11.0      22.5        6
2   2008   1      11.0      22.5        4
3   2008   1      11.0      22.5        7
4   2008   1      11.0      22.5        3
5   2008   1      11.0      22.5        5
6   2008   2      23.0      15.0        7
7   2008   2      23.0      15.0        3
8   2008   2      23.0      15.0        5
9   2008   2      23.0      15.0        6
10  2008   2      23.0      15.0        9
11  2008   3      12.0      15.0        6
12  2008   3      12.0      15.0        1