I have an CSV file that has the following structure (when viewed on Excel), in which it has unique section headers (strings) occupying a single cell, followed by a block with column names and data.
This format repeats itself throughout the Excel spreadsheet. Note that each section has different col names and number, and variable number of rows. Example:
Daily Statements
Date Desc Costumer ID Phone Status
12/21/21 aaa 1 123-123-1231 OK
12/21/21 aaa 2 333-123-1231 OK
12/21/21 bbb 3 222-123-1231 OK
12/21/21 bbb 3 444-123-1231 OK
<===== one empty row separates sections
Account History
Date Time Type Ref # Balance
12/21/21 1:00:00 BAL 456 $0.01
12/21/21 1:00:00 BAL 445 $0.01
12/21/21 1:00:00 BAL 645 $0.01
<===== one empty row separates sections
Order History
ID Date Ref #
1 12/21/21 777
2 12/21/21 888
3 12/21/21 999
4 12/21/21 9995
My goal is to extract only the rows in Account History:
Date Time Type Ref # Balance
12/21/21 1:00:00 BAL 456 $0.01
12/21/21 1:00:00 BAL 445 $0.01
12/21/21 1:00:00 BAL 645 $0.01
However I'm unable to find an approach that would work in Pandas, as I'd need to use the string "Account History" as an anchor to indicate the rows of interest.
Do you know how this could be achieved?
CodePudding user response:
I don't see a straightforward way to do that just with Pandas. Why not read the file first just as a text file to find the rows of interest, and only then, use Pandas to import just those rows?
with open(file, 'r') as f:
# read until the line "Account History"
for line_n, line_content in enumerate(f):
if "Account History" in line_content:
break
start_row = line_n 1
# continue reading, and find the following new line
for line_n, line_content in enumerate(f):
if line_content == '\n':
break
tab_size = line_n - 1
# import the dataframe, just from the target lines
df = pd.read_csv(file, skiprows=start_row, nrows=tab_size)