Home > database >  Calculate the mean in pandas while a column has a string
Calculate the mean in pandas while a column has a string

Time:11-01

I am currently learning pandas and I am using an imdb movies database, which one of the columns is the duration of the movies. However, one of the values is "None", so I can´t calculate the mean because there is this string in the middle. I thought of changing the "None" to = 0, however that would skew the results. Like can be seen with the code below.

dur_temp = duration.replace("None", 0)
dur_temp = dur_temp.astype(float)
descricao_duration = dur_temp.mean()

Any ideas on what I should do in order to not skew the data? I also graphed it and it becomes more clear how it skews it.

CodePudding user response:

You can replace "None" with numpy.nan, instead that using 0.

Something like this should do the trick:

import numpy as np
dur_temp = duration.replace("None", np.nan)
descricao_duration = dur_temp.mean()

CodePudding user response:

if you want it working for any string in your pandas serie, you could use pd.to_numeric:

pd.to_numeric(dur_temp, errors='coerce').mean()

in this way all the values ​​that cannot be converted to float will be replaced by NaN regardless of which is

CodePudding user response:

Just filter by condition like this

df[df['a']!='None'] #assuming your mean values are in column a

CodePudding user response:

Make them np.NAN values

I am writing it as answer because i can't comment df = df.replace('None ', np.NaN) or df.replace('None', np.NaN, inplace=True)

CodePudding user response:

You can use fillna(value=np.nan) as shown below:

descricao_duration = dur_temp.fillna(value=np.nan).mean()

Demo:

import pandas as pd
import numpy as np

dur_temp = pd.DataFrame({'duration': [10, 20, None, 15, None]})
descricao_duration = dur_temp.fillna(value=np.nan).mean()
print(descricao_duration)

Output:

duration    15.0
dtype: float64
  • Related