I am trying to clean and fill out around 300 columns. I have already replaced all the empty fields with 'NaN', and now I am trying to convert those values to 0, if certain checks are passed:
- NaN values need to be present in the column.
- There cannot already exist 0 values in the column.
- If 0 already exists, replace with 0.1 instead. (I am still trying to figure out what to replace with, since 0 already contributes with relevant information for that particular column in the dataframe)
thus far I have implemented
def convert(df, col):
if (df[col].isnull().sum() > 0): #& (df[df[col] != '0'])
#if (df[df[col] != '0']):
df[col].replace(np.NaN, '0', inplace = True)
for col in df.columns:
convert(df, col)
But, checking for the second condition (no zeroes can exist in the column already) is not working. Tried to implement it (commented out part), but returns following error:
TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]
On an another note, regarding the field of Data Science; I am not sure whether some of the columns should have their empty fields replaced by the column-mean instead of 0. I have features describing weight, dimensions, prices etc.
CodePudding user response:
Use boolean mask.
Suppose the following dataframe:
>>> df
A B C
0 0.0 1 2.0
1 NaN 4 5.0 # <- NaN should be replace by 0.1
2 6.0 7 NaN # <- NaN should be replace by 0
m1 = df.isna().any() # Is there a NaN in columns (not mandatory)
m2 = df.eq(0).any() # Is there a 0 in columns
# Replace by 0
df.update(df.loc[:, m1 & ~m2].fillna(0))
# Replace by 0.1
df.update(df.loc[:, m1 & m2].fillna(0.1))
Only the second mask is useful
Output result:
>>> df
A B C
0 0.0 1 2.0
1 0.1 4 5.0
2 6.0 7 0.0