Calculate the mean in pandas while a column has a string-CodePudding

I am currently learning pandas and I am using an imdb movies database, which one of the columns is the duration of the movies. However, one of the values is "None", so I can´t calculate the mean because there is this string in the middle. I thought of changing the "None" to = 0, however that would skew the results. Like can be seen with the code below.

dur_temp = duration.replace("None", 0)
dur_temp = dur_temp.astype(float)
descricao_duration = dur_temp.mean()

Any ideas on what I should do in order to not skew the data? I also graphed it and it becomes more clear how it skews it.

CodePudding user response：

You can replace "None" with numpy.nan, instead that using 0.

Something like this should do the trick:

import numpy as np
dur_temp = duration.replace("None", np.nan)
descricao_duration = dur_temp.mean()

CodePudding user response：

if you want it working for any string in your pandas serie, you could use pd.to_numeric:

pd.to_numeric(dur_temp, errors='coerce').mean()

in this way all the values that cannot be converted to float will be replaced by NaN regardless of which is

CodePudding user response：

Just filter by condition like this

df[df['a']!='None'] #assuming your mean values are in column a

CodePudding user response：

Make them np.NAN values

I am writing it as answer because i can't comment df = df.replace('None ', np.NaN) or df.replace('None', np.NaN, inplace=True)

CodePudding user response：

You can use fillna(value=np.nan) as shown below:

descricao_duration = dur_temp.fillna(value=np.nan).mean()

Demo:

import pandas as pd
import numpy as np

dur_temp = pd.DataFrame({'duration': [10, 20, None, 15, None]})
descricao_duration = dur_temp.fillna(value=np.nan).mean()
print(descricao_duration)

Output:

duration    15.0
dtype: float64