I have a function that accomplishes the following requirements:
- For each row, ascertain whether or not the column values for the row are above the median value for the respective columns
- If more 50% or more of the columns in the row are above the median, drop the row
I have tried to accomplish this with the following function:
def remove_rows(df):
col_count = len(df.columns)
remove_rows = []
for idx, row in tqdm(df.iterrows(), total=df.shape[0]):
count = 0
for col in df.columns:
if int(count) >= int(0.5*col_count):
remove_rows.append(idx)
elif row[col] >= df[col].median():
count =1
else:
continue
return df[~df.index.isin(remove_rows)]
However, this code is incredibly slow to process.
Are there other built-in pandas
methods I can use to speed up this analysis? The example takes nearly ~50 minutes to process. My actual data has shape (1039221, 84)
MRE below:
from sklearn.datasets import make_blobs
import pandas as pd
from tqdm import tqdm
def remove_rows(df):
col_count = len(df.columns)
remove_rows = []
for idx, row in tqdm(df.iterrows(), total=df.shape[0]):
count = 0
for col in df.columns:
if int(count) >= int(0.5*col_count):
remove_rows.append(idx)
elif row[col] >= df[col].median():
count =1
else:
continue
return df[~df.index.isin(remove_rows)]
data = pd.DataFrame(data=make_blobs(n_samples=100000, n_features=10, random_state=8)[0], columns=["feat_{}".format(i) for i in range(10)])
new = remove_real(data)
CodePudding user response:
Compare the rows with the median, count the percentage with mean(axis=1)
, then you can filter with boolean indexing:
data[data.ge(data.median()).mean(axis=1) < 0.5]
CodePudding user response:
I did some small improvement using numba
, but it is just 20% faster while code is much less elegant than @Quang Hoang's.
import numba as nb
nb.jit(nopython=True, parallel=True)
def remove_rows(df):
df_values = df.values
df_values_T = df_values.T
medians = np.zeros(len(df_values[0]))
for idx in nb.prange(len(df_values[0])):
medians[idx] = np.median(df_values.T[idx])
medians = np.array(medians)
is_above_median = df_values >= medians
return df[np.sum(is_above_median, axis=-1) < 0.5 * len(df.columns)]