Home > Enterprise >  Missing Values in Numeric Columns
Missing Values in Numeric Columns

Time:04-12

I have a dataframe with an age column. It looks like this: enter image description here

However, some values are missing. For now, I replaced them with the most occurring values like this:

df_processed = df_processed.apply(lambda x: x.fillna(x.value_counts().index[0]))

but I want to replace them with an unknown category. However, it is weird to me to put a text 'unknown' in a numeric categorical column. What should I replace the missing values to? I want NUMERIC 'unknown' category. I heard thAT 0 is a bad idea for age.

CodePudding user response:

In Python, there are:

  • None, NoneType
  • NaN, a float but means "Not a Number"

I would prefer the 2nd one in pandas/numpy context, especially because it allows numerical comparisons.

CodePudding user response:

You are probably better off interpreting the missing values.

# There are various ways to deal with missing data points.
# You can simply drop records if they contain any nulls.
# data.dropna()
# You can fill nulls with zeros
# data.fillna(0)
# You can also fill with mean, median, or do a forward-fill or back-fill.
# The problems with all of these options, is that if you have a lot of missing values for one specific feature, 
# you won't be able to do very reliable predictive analytics.
# A viable alternative is to impute missing values using some machine learning techniques 
# (regression or classification).
import pandas as pd
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

# Load data
data = pd.read_csv('C:\\titanic.csv')
print(data)
list(data)
data.dtypes

# Now, we will use a simple regression technique to predict the missing values
data_with_null = data[['survived','pclass','sibsp','parch','fare','age']]

data_without_null = data_with_null.dropna()
train_data_x = data_without_null.iloc[:,:5]
train_data_y = data_without_null.iloc[:,5]

linreg.fit(train_data_x,train_data_y)

test_data = data_with_null.iloc[:,:5]
age = pd.DataFrame(linreg.predict(test_data))

# check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)

# Find missing per feature
print(data_with_null.isnull().sum())

# Find any/all missing data points in entire data set
print(data_with_null.isnull().sum().sum())

# View age feature
age = list(linreg.predict(test_data))
print(age)

# Finally, we will join our predicted values back into the 'data_with_null' dataframe
data_with_null.age = age

# Check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)

https://github.com/ASH-WICUS/Notebooks/blob/master/Fillna with Predicted Values.ipynb

  • Related