With Python 3.10
Sample data:
import pandas as pd
data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
[11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
[87,10026, 54]]
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
print(df)
df = df.mask(df == 0).fillna(df.mean())
print(df) # <---this works but you will see what I mean about looking off..
Updated Solution:
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = round(df['data'].rolling(4, 1).apply(lambda x: np.nanmean(x)), 2)
df['final2'] = np.where(df['data'] > 0, df['data'], df['ma'])
print(df)
# it replaces the zeros and NULLS with a value, (sometimes it fits well, sometimes, not so much).
The idea is I have one or more column(s) with bad or missing data.
If I use .fillna(df.mean()) for this it sticks out like a sore thumb.
My Goal is to have a percentage of the total number of elements in the dataframe column to make the new mean from...
I would like to take a len(df)*0.30 (30%) and use divide it in half.
I would collect half the numbers above the index point where the (null/0/bad data) data exists.
I would collect half the numbers below the index where the
These collected elements would be the be used to calculate the missing or bad index point.
This would be more helpful if there were a data set that irregular or had missing bad data
CodePudding user response:
You can take a rolling mean with min periods = 1 to smooth out the data. or you can do a variant of this method to customise what you want.
Inside the lambda
i used this np.nanmean(x)
.
import pandas as pd
import numpy as np
data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
[11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
[87,10026, 54]]
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = df['data'].rolling(3,1).apply(lambda x : np.nanmean(x))
df['final'] = np.where(df['data'] >= 0, df['data'], df['ma'])
print(df)
result:
column data something ma final
0 1 14890.0 3 14890.000000 14890.0
1 4 5.0 6 7447.500000 5.0
2 7 8.0 9 4967.666667 8.0
3 11 13.0 14 8.666667 13.0
4 12 0.0 18 7.000000 0.0
5 87 NaN 54 6.500000 6.5
6 1 0.0 3 0.000000 0.0
7 4 5.0 6 2.500000 5.0
8 7 8.0 9 4.333333 8.0
9 11 13.0 14 8.666667 13.0
10 12 0.0 18 7.000000 0.0
11 87 NaN 54 6.500000 6.5
12 1 0.0 3 0.000000 0.0
13 4 5.0 6 2.500000 5.0
14 7 8.0 9 4.333333 8.0
15 11 13.0 14 8.666667 13.0
16 12 0.0 18 7.000000 0.0
17 87 10026.0 54 3346.333333 10026.0