Home > Mobile >  Filling missing data based on the variable data distribution
Filling missing data based on the variable data distribution

Time:10-11

I am trying to create a for loop that fills in missing values across 50 variables. The logic I have applied is that if a variable (cols) fulfils mode>median>mean or mode<median<mean (i.e. skewed) the missing values within the variable should be filled with the median of the variable. If the mode=median=mean (i.e. normal distribution) then the variable missing values should be filled with the mean of the variable. If the variable then does not fulfil the conditions, the missing values within the variable are filled with the median. I have been getting the following error:- ‘ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().’

I have a slight understanding of the error however am unsure of how to solve the problem. I began taking the approach using if condition statements for pandas but still got an error. I have pasted below my code. Many thanks for your help in advance!

Approach 1

  #filling data based on the variable distribution

for cols in num_cols2:
    if ((df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())) | ((df[cols].mean() > df[cols].median()) & (df[cols].median() > df[cols].mode())):
        df[cols]=df[cols].fillna(df.median())
    elif ((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode())):
        df[cols]=df[cols].fillna(df.mean().iloc[0])
    else:
        df[cols]=df[cols].fillna(df.median())

Error message below

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/admin/Library/CloudStorage/OneDrive-Personal/DA Material/Data Science 6/EDAPipeDetectionleak.ipynb Cell 34 in <cell line: 3>()
      1 #filling data based on distribution
      3 for cols in num_cols2:
----> 4     if ((df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())) | ((df[cols].mean() > df[cols].median()) & (df[cols].median() > df[cols].mode())):
      5         df[cols]=df[cols].fillna(df.median())
      6     elif ((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode())):

File /opt/homebrew/lib/python3.10/site-packages/pandas/core/generic.py:1527, in NDFrame.__nonzero__(self)
   1525 @final
   1526 def __nonzero__(self):
-> 1527     raise ValueError(
   1528         f"The truth value of a {type(self).__name__} is ambiguous. "
   1529         "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1530     )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I also tried the following approaches:-

Approach 2 outputted the same error as above

for cols in num_cols2:
    df[cols] = df[cols].apply(lambda cols:(df[cols].fillna(df.median()))) if (df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode()) else (df[cols].fillna(df.mean()))

Approach 3

for cols in num_cols2:
    df.loc[(df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())] = df[cols].fillna(df.median())
    df.loc[df[cols].mean() > df[cols].median() & (df[cols].median() > df[cols].mode())] = df[cols].fillna(df.median())
    df.loc[((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode()))] = df[cols].fillna(df.mean().iloc[0])
for cols in num_cols2:
    df[cols] = df.loc[(df[cols].mean() < df[cols].median()) & (df[cols].median() < df[cols].mode())] = df[cols].fillna(df.median())
    df[cols] = df.loc[df[cols].mean() > df[cols].median() & (df[cols].median() > df[cols].mode())] = df[cols].fillna(df.median())
    df[cols] = df.loc[((df[cols].mean() == df[cols].median()) & (df[cols].median() == df[cols].mode()))] = df[cols].fillna(df.mean().iloc[0])

Error output for approach 3 is shown below

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

CodePudding user response:

Because working with scalars use and or or, for Series.mode return first value:

for col in num_cols2:
    avg = df[col].mean()
    med = df[col].median()
    mod = df[col].mode().iat[0]

    if (avg == med) and (med == mod):

        df[col]=df[col].fillna(avg)
    else:
        df[col]=df[col].fillna(med)

But because avg is same like median for if condition above, you can simplify solution by replace missing values by median:

df[num_cols2] = df[num_cols2].fillna(df[num_cols2].median())
  • Related