I have a 2D numpy array with lack of data, and I want to fill them by giving a mathematical uniformity to the array. I got something like this :
[[72829],
[nan],
[73196],
[73087],
[nan],
[nan],
[72294.5]]
I want to fill those empy cells with the mean between the closest cells, with return with something like this :
[[72829],
[73012.5],
[73196],
[73087],
[72888.875],
[72492.625],
[72294.5]]
I tried to use SimpleImputer and KNNImputer from Scikit-learn, but all what I got is the same value to all data, not the mean between the cells as I mentioned before. Thats the code :
for label, column in data.iteritems():
reshaped = np.array(column.values) # Creating a np array to use scikitlearn
reshaped = reshaped.reshape(-1,1) # changing shape of data to a 2D array
normalized = imputer.fit_transform(reshaped) # transforming data
data[label] = normalized # changing the column value to the new one
With KNNImputer, I got something like this (The way that I don't want):
[[72829],
[68088.71106114],
[73196],
[73087],
[68088.71106114],
[68088.71106114],
[72294.5]]
Someone knows any ideia or algorithm that could give a "uniformity" to the array numbers like this ? The ideia is that the return of this method gives me the possibility to plot graphs without missing data. If were something with pandas/numpy/scikit-learn would be better, thanks.
CodePudding user response:
Convert data to a dataframe and use b(efore)fill
and f(orward)fill
x = [[72829],
[np.nan],
[73196],
[73087],
[np.nan],
[np.nan],
[72294.5]]
df = pd.DataFrame(x)
df = (df[0].bfill() df[0].ffill())/2
df
>>>
0 72829.00
1 73012.50
2 73196.00
3 73087.00
4 72690.75
5 72690.75
6 72294.50
CodePudding user response:
In[0]:
import pandas as pd
series = pd.Series([72829,
None,
73196,
73087,
None,
None,
72294.5])
series.interpolate(method='linear')
Out[0]:
0 72829.000000
1 73012.500000
2 73196.000000
3 73087.000000
4 72822.833333
5 72558.666667
6 72294.500000
dtype: float64