I am using the IMDB dataset for machine learning, and it contains a lot of missing values which are entered as '\N'. Specifically in the StartYear column which contains the movie year release I want to convert the values to integers. Which im not able to do right now, I could drop these values but I wanted to see why they're missing first. I tried several things but no success.
This is my latest attempt:
CodePudding user response:
Here is a way to do it without using replace
:
import pandas as pd
import numpy as np
df_basics = pd.DataFrame({'startYear':['\\N']*78760 [2017]*18267 [2018]*18263 [2016]*17837 [2019]*17769 ['1996 ','1993 ','2000 ','2019 ','2029 ']})
print(pd.value_counts(df_basics.startYear))
df_basics.loc[df_basics.startYear == '\\N','startYear'] = np.NaN
print(pd.value_counts(df_basics.startYear, dropna=False))
Output:
NaN 78760
2017 18267
2018 18263
2016 17837
2019 17769
1996 1
1993 1
2000 1
2019 1
2029 1