By these line of code I am trying to drop an entire row from dataframe which doesn't contains a string '-->' in any column values.
try:
for j in range(len(df)):
flg = 0
for i in df.columns:
if df[i].astype(str).str.contains('-->').iloc[j]:
flg = 1
if flg == 0:
df.drop(df.index[j], axis=0, inplace=True)
except:
pass
This is a working code. The question is can we write this piece of code in more optimized way as this code is taking more time if we have 20K or more rows in dataframe.
CodePudding user response:
You could vectorize:
mask = df.astype(str).apply(lambda column : column.str.contains('-->')).any(axis=1)
df = df[mask]
CodePudding user response:
Select only object
columns by DataFrame.select_dtypes
and test if exist -->
substrings, numeric and datetimes columns are not tested and filter all rows where it least one value matched by DataFrame.any
in boolean indexing
:
df = df[df.select_dtypes(object).apply(lambda x : x.str.contains('-->')).any(axis=1)]
Or convert all columns to strings, it should be slowier, because tested all columns (also numeric without substring -->
):
df = df[df.astype(str).apply(lambda x : x.str.contains('-->')).any(axis=1)]