Right now, I am parsing my file by using skiprows, but the skiprows is unreliable because the data can change. I want to skip rows based on keywords such as "Ferrari, Apple, Baseball". How can I accomplish this? Could you please provide examples?
EDIT: If possible, another solution that could work better for me is to skip n rows in the beginning and then stop reading values in the columns after a BLANK entry is reached. Is this possible?
import pandas as pd
import pyodbc
df = pd.read_csv(r'C://mycsvfile.csv', skiprows=[3,108,109,110,111,112,114,115,116,118])
"""
Step 2 Specify columns we want to import
"""
columns = ['Run Date','Action','Symbol','Security Description','Security Type','Quantity','Price ($)','Commission ($)','Fees ($)','Accrued Interest ($)','Amount ($)','Settlement Date']
df_data = df[columns]
records = df_data.values.tolist()
print(df)
CodePudding user response:
You can try parse every column and try to find keyword which you need and delete row where your keyword contains.
df = df[df["Run Date"].str.contains("Ferrari") == False]
Make it loop.
CodePudding user response:
There's a few ways to do it. Here's my solution.
- Make all keywords lower case to eliminate case sensitive
- Define which columns you need to check for the keywords (I could alter this to check all columns if needed)
- Concatenate the columns to check all columns at once as opposed to iterating through each
- Make the cells all lower case (see 1)
- Keep rows that do not contain a keyword
Code:
import pandas as pd
keywords = ['Ferrari', 'Apple', 'Baseball']
keywords = '|'.join(keywords)
keywords = keywords.lower()
df = pd.DataFrame([['I love apples.', '', 1, 'Jan 1, 2021'],
['Apple is tasty.', 'Ferrari', 2, 'Jan 2, 2022'],
['This does not contain a keyword', 'Nor does this.', 15, 'Mar 1, 2021'],
['This row is ok', 'But it has baseball in it.', 34, 'Feb 1, 2021']], columns = ['A','B','Value','Date'])
columns_to_check = ['A','B', 'Value']
df = df[~df[columns_to_check].astype(str).sum(1).str.lower().str.contains(keywords)]
Input:
print(df.to_string())
A B Value Date
0 I love apples. 1 Jan 1, 2021
1 Apple is tasty. Ferrari 2 Jan 2, 2022
2 This does not contain a keyword Nor does this. 15 Mar 1, 2021
3 This row is ok But it has baseball in it. 34 Feb 1, 2021
Output:
print(df.to_string())
A B Value Date
2 This does not contain a keyword Nor does this. 15 Mar 1, 2021