Home > other >  How to select rows in Pandas after a string header in a spreadsheet?
How to select rows in Pandas after a string header in a spreadsheet?

Time:12-29

I have an CSV file that has the following structure (when viewed on Excel), in which it has unique section headers (strings) occupying a single cell, followed by a block with column names and data.

This format repeats itself throughout the Excel spreadsheet. Note that each section has different col names and number, and variable number of rows. Example:

    Daily Statements                                    
    Date        Desc    Costumer ID   Phone          Status
    12/21/21    aaa     1             123-123-1231   OK 
    12/21/21    aaa     2             333-123-1231   OK
    12/21/21    bbb     3             222-123-1231   OK
    12/21/21    bbb     3             444-123-1231   OK
                                              <===== one empty row separates sections
    Account History                                 
    Date        Time    Type    Ref #   Balance
    12/21/21    1:00:00 BAL     456     $0.01 
    12/21/21    1:00:00 BAL     445     $0.01
    12/21/21    1:00:00 BAL     645     $0.01
                                              <===== one empty row separates sections
    Order History                                   
    ID    Date      Ref #
    1     12/21/21  777  
    2     12/21/21  888 
    3     12/21/21  999
    4     12/21/21  9995

My goal is to extract only the rows in Account History:

    Date        Time    Type    Ref #   Balance
    12/21/21    1:00:00 BAL     456     $0.01 
    12/21/21    1:00:00 BAL     445     $0.01
    12/21/21    1:00:00 BAL     645     $0.01

However I'm unable to find an approach that would work in Pandas, as I'd need to use the string "Account History" as an anchor to indicate the rows of interest.

Do you know how this could be achieved?

CodePudding user response:

I don't see a straightforward way to do that just with Pandas. Why not read the file first just as a text file to find the rows of interest, and only then, use Pandas to import just those rows?

with open(file, 'r') as f:
    # read until the line "Account History"
    for line_n, line_content in enumerate(f):
        if "Account History" in line_content:
            break
    start_row = line_n   1

    # continue reading, and find the following new line
    for line_n, line_content in enumerate(f):
        if line_content == '\n':
            break
    tab_size = line_n - 1

# import the dataframe, just from the target lines
df = pd.read_csv(file, skiprows=start_row, nrows=tab_size)
  • Related