How to use pandas to read a .csv file and skip rows based on keywords?-CodePudding

Right now, I am parsing my file by using skiprows, but the skiprows is unreliable because the data can change. I want to skip rows based on keywords such as "Ferrari, Apple, Baseball". How can I accomplish this? Could you please provide examples?

EDIT: If possible, another solution that could work better for me is to skip n rows in the beginning and then stop reading values in the columns after a BLANK entry is reached. Is this possible?

import pandas as pd
import pyodbc
  

df = pd.read_csv(r'C://mycsvfile.csv', skiprows=[3,108,109,110,111,112,114,115,116,118])  

    
"""
Step 2 Specify columns we want to import
"""
columns = ['Run Date','Action','Symbol','Security Description','Security Type','Quantity','Price ($)','Commission ($)','Fees ($)','Accrued Interest ($)','Amount ($)','Settlement Date']

    
df_data = df[columns]
records = df_data.values.tolist()
    
print(df)

CodePudding user response：

You can try parse every column and try to find keyword which you need and delete row where your keyword contains.

df = df[df["Run Date"].str.contains("Ferrari") == False]

Make it loop.

CodePudding user response：

There's a few ways to do it. Here's my solution.

Make all keywords lower case to eliminate case sensitive
Define which columns you need to check for the keywords (I could alter this to check all columns if needed)
Concatenate the columns to check all columns at once as opposed to iterating through each
Make the cells all lower case (see 1)
Keep rows that do not contain a keyword

Code:

import pandas as pd

keywords = ['Ferrari', 'Apple', 'Baseball']
keywords = '|'.join(keywords)
keywords = keywords.lower()

df = pd.DataFrame([['I love apples.', '', 1, 'Jan 1, 2021'],
                   ['Apple is tasty.', 'Ferrari', 2, 'Jan 2, 2022'],
                   ['This does not contain a keyword', 'Nor does this.', 15, 'Mar 1, 2021'],
                   ['This row is ok', 'But it has baseball in it.', 34, 'Feb 1, 2021']], columns = ['A','B','Value','Date'])

columns_to_check = ['A','B', 'Value']
df = df[~df[columns_to_check].astype(str).sum(1).str.lower().str.contains(keywords)]

Input:

print(df.to_string())
                                 A                           B  Value         Date
0                   I love apples.                                  1  Jan 1, 2021
1                  Apple is tasty.                     Ferrari      2  Jan 2, 2022
2  This does not contain a keyword              Nor does this.     15  Mar 1, 2021
3                   This row is ok  But it has baseball in it.     34  Feb 1, 2021

Output:

print(df.to_string())
                                 A               B  Value         Date
2  This does not contain a keyword  Nor does this.     15  Mar 1, 2021