Home > Enterprise >  How to handle data if majority equals zero? Data cleaning
How to handle data if majority equals zero? Data cleaning

Time:10-11

I'm a beginner and I am exploring TMDB 10000 Movie Dataset, I found out the following for budget and revenue columns:

b_0 = df[df['budget']==0].shape[0]/df.shape[0]*100
print('percentage of zero budget movies: ',b_0,'%')

percentage of zero budget movies: 52.425218591808566 %

b_r_0 = df[(df['revenue']==0) & (df['budget']==0)].shape[0]/df.shape[0]*100

percentage of zero revenue and budget movies: 43.26737229636448 %

r_0 = df[df['revenue']==0].shape[0]/df.shape[0]*100
print('percentage of zero revenue movies: ',r_0,'%')

percentage of zero revenue movies: 55.37045559134837 %

I know for sure that budget/revenue cannot equal zero, the statistics calculated ( mean, median, quartiles) are biased due to zero values so I can't use them for replacement and I can't drop over 40% of data. How can I fix this?

data source: https://www.google.com/url?q=https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv&sa=D&ust=1532469042115000

CodePudding user response:

To know what is the best solution you must understand the real world where the data is coming from.

Mean or Median is usually the best.

You better first replace zeros with Null and then fill the Nulls with either median or mean.

If you need the code to do that - let me know

  • Related