Home > Software design >  Remove rows with specific words in panda DF
Remove rows with specific words in panda DF

Time:12-19

I would like to remove rows in a Panda DF if certain words are included in the text. The data is a financial statement. If the column 'account name' includes the words ['total', 'Total', 'Result'] the entire row should be removed.

I have tried to make a function that can be applied to clean the data.

I have made following function and for loop to solve it.

def is_subtotal1(name):
    return pd.notna(name) and 'total' in name
def is_subtotal2(name):
    return pd.notna(name) and 'Total' in name
def is_subtotal3(name):
    return pd.notna(name) and 'Result' in name

valid_accounts =[name for name in df['account_name'].unique() if not is_subtotal1(name)]
valid_accounts =[name for name in valid_accounts if not is_subtotal2(name)]
valid_accounts =[name for name in valid_accounts if not is_subtotal3(name)]

valid_accounts

Screen print of df df

CodePudding user response:

I assume you have a list of substring. With this list, you want to remove the row where the column account_name contains one of these substring. As you show on your data, I believe your column account_name is string type. You can directly filtered out the row that contains the substring by using vectorize operation with contains() function provide by Pandas:

df[~df.account_name.str.contains('|'.join(["total","Total","Result"])))]

CodePudding user response:

I'm assuming you want to delete all those rows where 'total', 'Total', or 'Result' occur as part of the value in the account_name field. Your code currently doesn't do that. You should use a pattern-matching algorithm like the KMP algorithm for this.

  • Related