I am new to machine learning and I am trying to learn and grow. I have stumbled upon this problem. I have this dataframe that contains these columns:
ispublicaccount country_AE country_SA age
now the ispublicaccount contains some missing data and the age (which i want to find the missing age) also contains missing data.
how can I create a machine learning model to predict the missing age values ?
I have the original dataframe: df_simplified
ispublicaccount|country_AE|country_SA|age|
1 1 0 |41
2 1 0 NaN
1 0 1 NaN
NaN 1 0 23
0 0 1 31
1 0 1 NaN
1 0 1 19
2 1 0 24
.....
of course there are a lot more data but this is in a nuthsell
now I know how to create a model and predict if I have full dataset with no missing values then I would create a model and fit the data and predict. but how can I deal with these missing data in here and predict the missing age values? thank you for your helpp
CodePudding user response:
deal with missing values is a big issue in ML, you could delete the row with missing values if you have enough data but if you dong you could replace them with the last valid observation or with the mean value or median I think if you can't delete rows with missing values, consider using mean instead of NANs. to drop rows you could use this:
data['age']
and to replace with mean value, use this:
data['age'].fillna(value=data['age'].mean(), inplace=True)