Replace value with the average of it's column with Pandas-CodePudding

I have a csv file with multiple columns containing numeric values. However, some rows are missing a value in one or more columns. The missing value is denoted by 'x'.

How would I replace all the missing values ('x') with the average of all values in their column using Pandas?

df=  pd.read_csv('file.csv')

CodePudding user response：

avg = df[column].mean()
df[column] = df[column].apply(lambda v: avg if v=="x" else v)

CodePudding user response：

The first thing to recognize is the columns that have 'x' in them are not integers. They are objects.

df = pd.read_csv('file.csv')

df

    Col1    Col2
0   1   22
1   2   44
2   3   x
3   4   88
4   5   110
5   6   132
6   7   x
7   8   176
8   9   198
9   10  x

df.dtypes

Col1     int64
Col2    object
dtype: object

In order to get the mean of Col2, it needs to be converted to a numeric value.

df['Col2'] = pd.to_numeric(df['Col2'], errors='coerce').astype('Int64')

df.dtypes
Col1    int64
Col2    Int64
dtype: object

The df now looks like so:

df 

Col1    Col2
0   1   22
1   2   44
2   3   <NA>
3   4   88
4   5   110
5   6   132
6   7   <NA>
7   8   176
8   9   198
9   10  <NA>

Now we can use fillna() with df['Col2'].mean():

df['Col2'] = df['Col2'].fillna(df['Col2'].mean())

df
    Col1    Col2
0   1   22
1   2   44
2   3   110
3   4   88
4   5   110
5   6   132
6   7   110
7   8   176
8   9   198
9   10  110

CodePudding user response：

You're going to run into problems taking the mean of columns that have non-numeric values such as 'x'. You'll have to fix that first, such as with df = df.replace('x',np.nan) (this works only if x is the only non-numeric value) and then you can dodf = df.fillna(df.mean()). If you start out with integers, the columns with x in them will be converted to float (even the rows that weren't replaced), so if you don't want that, you'll have to do more work.