I have a large dataset with 200 columns. I need to do preprocessing on it and the most of the numerical attributes have random string values in them, 'XX'.
I want to get rid of all the 'XX' in the dataset and replace them with the mean of each corresponding column.
I understand how I would do it for each individual column using .replace and np.mean , but I havent been able to figure out how to apply it to all the columns.
aka I understand:
df['col1'] = df['col1'].replace('XX',
np.mean(pd.to_numeric(df['col1'], errors='coerce')))
but need a generalized version of it that I can apply to all columns that have string values in them
CodePudding user response:
Some test data:
import numpy as np
import pandas as pd
testdata = {'A': ['XX', 0.09, 0.76, 'XX', 0.65, 0.21, 0.18, 0.97, 'XX', 'XX'],
'B': ['XX', 0.46, 0.29, 'XX', 0.44, 0.59, 'XX', 0.47, 0.37, 'XX']}
df = pd.DataFrame(testdata)
>>> df.head()
A B
0 XX XX
1 0.09 0.46
2 0.76 0.29
3 XX XX
4 0.65 0.44
Then you can do:
>>> df = df.replace("XX", np.nan)
>>> df.head()
A B
0 NaN NaN
1 0.09 0.46
2 0.76 0.29
3 NaN NaN
4 0.65 0.44
>>> df.fillna(df.mean()).head()
A B
0 0.476667 0.436667
1 0.090000 0.460000
2 0.760000 0.290000
3 0.476667 0.436667
4 0.650000 0.440000
CodePudding user response:
you can do it in two steps!
- fill the string values with NaN values
(if dtype of the column if 'object' do
df['col'] = df['col'].astype(int/float)
) - then fillna with mean