Home > Back-end >  index out of range error for two pandas series that should be basically the same
index out of range error for two pandas series that should be basically the same

Time:01-29

a=cosmos.isna().sum()
c=len(cosmos)
a=a/c*100
for i in range(len(a)):
    if a[i]>80:
        cosmos.drop(columns=cosmos.columns[i], axis=1, inplace=True)

index out of bounds error the a and cosmos.columns should basically have the same length. i am trying to drop some columns. but it shows IndexError: index 7 is out of bounds for axis 0 with size 6 i specifically mentioned axis=1 i don't know what it has to do with axis 0

i have no idea what to do i just want to drop all columns with more than 80 percent empty rows. so i could do it one by one this time. i tried doing it all again but it didn't help.

CodePudding user response:

The error you're having is likely due to the fact that you're dropping columns inplace. As you iterate, the dataframe cosmos gets shorter, yet you index those columns using the original integer index i. As a rule of thumb, you should avoid modifying a dataframe (or any sequence in general) while iterating that same object.

That aside, there are better panda-esque solutions that take (or drop) the relevant columns in one operation, which avoids iterating all together. Here is one:

import numpy as np
import pandas as pd

# Sample data
cosmos = pd.DataFrame({
    "a": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0],
    "b": [np.nan, 3, 4, 7, 5, 3, 2, np.nan, 1, 2],
    "c": [np.nan, np.nan, np.nan, np.nan, 6, np.nan, np.nan, np.nan, np.nan, np.nan],
    "d": [np.nan] * 10
})

# Use .mean instead of .sum, which avoids the `/ len(df)` step
nan_pct = cosmos.isna().mean()

cosmos = cosmos.loc[:, nan_pct <= 0.8]

which uses a boolean mask to select only those columns where less than 80% of its values are nan.

  • Related