I would like to remove rows in a Panda DF if certain words are included in the text. The data is a financial statement. If the column 'account name' includes the words ['total', 'Total', 'Result'] the entire row should be removed.
I have tried to make a function that can be applied to clean the data.
I have made following function and for loop to solve it.
def is_subtotal1(name):
return pd.notna(name) and 'total' in name
def is_subtotal2(name):
return pd.notna(name) and 'Total' in name
def is_subtotal3(name):
return pd.notna(name) and 'Result' in name
valid_accounts =[name for name in df['account_name'].unique() if not is_subtotal1(name)]
valid_accounts =[name for name in valid_accounts if not is_subtotal2(name)]
valid_accounts =[name for name in valid_accounts if not is_subtotal3(name)]
valid_accounts
CodePudding user response:
I assume you have a list of substring. With this list, you want to remove the row where the column account_name
contains one of these substring. As you show on your data, I believe your column account_name
is string type. You can directly filtered out the row that contains the substring by using vectorize operation with contains()
function provide by Pandas:
df[~df.account_name.str.contains('|'.join(["total","Total","Result"])))]
CodePudding user response:
I'm assuming you want to delete all those rows where 'total', 'Total', or 'Result' occur as part of the value in the account_name field. Your code currently doesn't do that. You should use a pattern-matching algorithm like the KMP algorithm for this.