Given a data set, such as:
(FirstName, LastName, Sex, DateOfBirth, HairColor, EyeColor, Height, Weight, Location)
that some model can train on, what kind of Machine Learning paradigm can be used to predict missing values if only given some of them?
Example:
Given:
(FirstName: John, LastName: Doe, Sex: M, Height: (5,10))
What model could predict the missing values?
(DateOfBirth, HairColor, EyeColor, Weight, Location)
In other words, the model should be able to take any of the fields as inputs, and "fill in" any that are missing
... and what type of ML/DL is this even called?
CodePudding user response:
If you're looking to fill missing values with an algorithm, this is called imputing missing data. If you're using Python, the scikit-learn library has a number of imputation algorithms that you can explore in the docs.
One nice algorithm is KNNImputer, which looks n_neighbors
most similar observations to the current observation and fills the missing data with mean for the column from those similar observations. Read more here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html
CodePudding user response:
If there are a lot of missing values in a row, first need to understand: will it add value to my problem? Else drop such rows which have a lot of missing values.
One way to handle: Remove the target variable. Using the features which do not have missing values, predict columns that have missing values. Use ML algorithms, to predict and fill those values. Then again use previously imputed missing values to predict other missing values.
Eg: if features and target are: X1, X2, X3, X4, Y Let X1 and X2 do not have missing values, X3 and X4 have missing values. First, keep aside Y. Using X1 and X2, fill missing values in X3 with the help of ML algorithms. Again, using X1, X2, X3 fill missing values in X4. Then finally predict the target values (Y).
I have used this method in hackathons and got good results. Before applying this, first, try to get a good understanding of the data. The approach might be slightly different from what you have asked, but this is a decent approach for such problems.