Home > Back-end >  Manage the missing value in a dataframe with string and number
Manage the missing value in a dataframe with string and number

Time:02-10

I have a dataframe with some string columns and number columns. I want to manage the missing values. I want to change the "nan" values with mean of each row. I saw the different question in this website, however, they are different from my question. Like this link: Pandas Dataframe: Replacing NaN with row average

If all the values of a rows are "Nan" values, I want to delete that rows. I have also provide a sample case as follows:

import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]

df = pd.DataFrame()
df['id'] = ['a', 1, 'n']
df['md'] = ['d', 6, 'l']
df['c1'] = [2, 6, 5]
df['c2'] = [0, 5, 3]
df['c3'] = [8, 7,4]
df

Note: I have used the following code, however it is very slow and for a big dataframe it take a looong time to run.

index_colum = df.columns.get_loc("c1")
df_withno_id = df.iloc[:,index_colum:]

rowsidx_with_all_NaN = df_withno_id[df_withno_id.isnull().all(axis=1)].index.values
df = df.drop(df.index[rowsidx_with_all_NaN])

for i, cols in df_withno_id.iterrows():
   if i not in rowsidx_with_all_NaN:
      endsidx = len(cols)
      extract_data = list(cols[0:endsidx])
      mean = np.nanmean(extract_data)
      fill_nan = [mean for x in extract_data if np.isnan(x)]
      df.loc[i] = df.loc[i].replace(np.nan, mean)

Can anybody help me with this? thanks.

CodePudding user response:

First, you can select only float columns types. Second, for these columns drop rows with all nan values. Finally, you can transpose dataframe (only float columns), calculate average value and later transpose again.

import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]

numeric_cols = df.select_dtypes(include='float64').columns
df.dropna(how = 'all', subset = numeric_cols, inplace = True)
df[numeric_cols] = df[numeric_cols].T.fillna(df[numeric_cols].T.mean()).T
df
  • Related