I have an unemployment data for the 30 countries and there are some missing values but in the excel sheet these all numbers are all strings so I first convert them to floats and then if row is empty then I want to replace row with its columns mean value. Function works well doesnt return any error but when I print the data still I have the Null values
data=pd.read_excel(r'C:\Users\OĞUZ\Desktop\employment.xlsx')
data=data.set_index('Unnamed: 0')
for column in data:
for row in column:
if len(row)>5:
row=float(row)
if row.isnull():
row=column.mean()
print(data['Argentina'].head())
This is what I get after print.
Unnamed: 0
1990 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
Name: Argentina, dtype: float64
CodePudding user response:
To fill NaNs use df.fillna(value)
. For the mean use df.mean()
. If your column is named Argentina
this could look like below:
df.Argentina.fillna(df.Argentina.mean(), inplace=True)
The inplace=True
is for the reassignment. The line is equivalent to
df.Argentina = df.Argentina.fillna(df.Argentina.mean())
Example
df = pd.DataFrame({'Argentina':[1,np.nan,2,4]}, index=[1990, 1991, 1992, 1993])
>>> df
Argentina
1990 1.0
1991 NaN
1992 2.0
1993 4.0
df.Argentina.fillna(df.Argentina.mean(), inplace=True)
>>> df
Argentina
1990 1.000000
1991 2.333333
1992 2.000000
1993 4.000000
CodePudding user response:
You can either iterate over the columns, or use DataFrame.transform
or DataFrame.apply
.
Whichever approach you use, you'll want to:
- Convert column values from strings to floats
- Calculate the mean of the column
- Use
Series.fillna
to fill the NaN values with the previously calcualted value
Create Data
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
df = pd.DataFrame({
"a": rng.integers(5, size=10),
"b": rng.integers(5, 10, size=10),
"c": rng.integers(10, 15, size=10)
}).astype(str)
df.loc[2:5, :] = np.nan
# note all the numbers you see are actually strings
print(df)
a b c
0 4 8 11
1 3 9 14
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 0 8 12
7 0 7 10
8 0 7 13
9 4 9 13
Solution - DataFrame transform
def clean_column(series):
series = pd.to_numeric(series, downcast="float")
avg = series.mean()
return series.fillna(avg)
new_df = df.transform(clean_column)
print(new_df)
0 4.000000 8.0 11.000000
1 3.000000 9.0 14.000000
2 1.833333 8.0 12.166667
3 1.833333 8.0 12.166667
4 1.833333 8.0 12.166667
5 1.833333 8.0 12.166667
6 0.000000 8.0 12.000000
7 0.000000 7.0 10.000000
8 0.000000 7.0 13.000000
9 4.000000 9.0 13.000000