I have a csv
file with multiple columns containing numeric values. However, some rows are missing a value in one or more columns. The missing value is denoted by 'x'.
How would I replace all the missing values ('x') with the average of all values in their column using Pandas?
df= pd.read_csv('file.csv')
CodePudding user response:
avg = df[column].mean()
df[column] = df[column].apply(lambda v: avg if v=="x" else v)
CodePudding user response:
The first thing to recognize is the columns that have 'x' in them are not integers. They are objects.
df = pd.read_csv('file.csv')
df
Col1 Col2
0 1 22
1 2 44
2 3 x
3 4 88
4 5 110
5 6 132
6 7 x
7 8 176
8 9 198
9 10 x
df.dtypes
Col1 int64
Col2 object
dtype: object
In order to get the mean of Col2, it needs to be converted to a numeric value.
df['Col2'] = pd.to_numeric(df['Col2'], errors='coerce').astype('Int64')
df.dtypes
Col1 int64
Col2 Int64
dtype: object
The df now looks like so:
df
Col1 Col2
0 1 22
1 2 44
2 3 <NA>
3 4 88
4 5 110
5 6 132
6 7 <NA>
7 8 176
8 9 198
9 10 <NA>
Now we can use fillna() with df['Col2'].mean():
df['Col2'] = df['Col2'].fillna(df['Col2'].mean())
df
Col1 Col2
0 1 22
1 2 44
2 3 110
3 4 88
4 5 110
5 6 132
6 7 110
7 8 176
8 9 198
9 10 110
CodePudding user response:
You're going to run into problems taking the mean of columns that have non-numeric values such as 'x'. You'll have to fix that first, such as with df = df.replace('x',np.nan)
(this works only if x
is the only non-numeric value) and then you can dodf = df.fillna(df.mean())
. If you start out with integers, the columns with x
in them will be converted to float (even the rows that weren't replaced), so if you don't want that, you'll have to do more work.