Home > Software design >  Optimization of dropping rows in Pandas
Optimization of dropping rows in Pandas

Time:10-29

I have a function that accomplishes the following requirements:

  • For each row, ascertain whether or not the column values for the row are above the median value for the respective columns
  • If more 50% or more of the columns in the row are above the median, drop the row

I have tried to accomplish this with the following function:

def remove_rows(df):
    col_count = len(df.columns)
    remove_rows = []

    for idx, row in tqdm(df.iterrows(), total=df.shape[0]):
        count = 0
        for col in df.columns:
            if int(count) >= int(0.5*col_count):
                remove_rows.append(idx)
            elif row[col] >= df[col].median():
                count  =1
            else:
                continue
    
    return df[~df.index.isin(remove_rows)]

However, this code is incredibly slow to process.

Are there other built-in pandas methods I can use to speed up this analysis? The example takes nearly ~50 minutes to process. My actual data has shape (1039221, 84)

MRE below:

from sklearn.datasets import make_blobs
import pandas as pd
from tqdm import tqdm

def remove_rows(df):
    col_count = len(df.columns)
    remove_rows = []

    for idx, row in tqdm(df.iterrows(), total=df.shape[0]):
        count = 0
        for col in df.columns:
            if int(count) >= int(0.5*col_count):
                remove_rows.append(idx)
            elif row[col] >= df[col].median():
                count  =1
            else:
                continue
        
    return df[~df.index.isin(remove_rows)]

data = pd.DataFrame(data=make_blobs(n_samples=100000, n_features=10, random_state=8)[0], columns=["feat_{}".format(i) for i in range(10)])

new = remove_real(data)

CodePudding user response:

Compare the rows with the median, count the percentage with mean(axis=1), then you can filter with boolean indexing:

data[data.ge(data.median()).mean(axis=1) < 0.5]

CodePudding user response:

I did some small improvement using numba, but it is just 20% faster while code is much less elegant than @Quang Hoang's.

import numba as nb
nb.jit(nopython=True, parallel=True)
def remove_rows(df):
    df_values = df.values
    df_values_T = df_values.T
    medians = np.zeros(len(df_values[0]))
    for idx in nb.prange(len(df_values[0])):
        medians[idx] = np.median(df_values.T[idx])
    medians = np.array(medians)
    is_above_median = df_values >= medians
    return df[np.sum(is_above_median, axis=-1) < 0.5 * len(df.columns)]
  • Related