Openpyxl, Pandas or both-CodePudding

I'm trying to process an excel file so that i can use each row and column for specific operations later on.

My problem is as follows:

Using Openpyxl made it easier for me to load the file and be able to iterate over the rows

    #reading the excel file
    path = r'Datasets/Chapter 1/Table B1.1.xlsx'
    wb = load_workbook(path) #loading the excel table
    ws = wb.active #grab the active worksheet
    
    #Setting the doc Header
    for h in ws.iter_rows(max_row = 1, values_only = True): #getting the first row (Headers) in the table
        header = list(h)
    
    for sh in ws.iter_rows(min_row = 1 ,max_row = 2, values_only = True):
        sub_header = list(sh)
    
    #removing all of the none Values
    header = list(filter(None, header))
    sub_header = list(filter(None, sub_header))
    #creating a list of all the rows in the excel file
    row_list = []
    
    for row in ws.iter_rows(min_row=3): #Iteration over every single row starting from the third row since first two are the headers
        row = [cell.value for cell in row] #Creating a list from each row
        row = list(filter(None, row)) #removing the none values from each row
        row_list.append(row) #creating a list of all rows (starting from the 3d one)

    colm = []
    for col in ws.iter_cols(min_row=3,min_col = 1): #Iteration over every single row starting from the third row since first two are the headers
        col = [cell.value for cell in col] #Creating a list from each row
        col = list(filter(None, col)) #removing the none values from each row
        colm.append(col) #creating a list of all rows (starting from the 3d one)

but at the same time (as far as I've read in the docs), I can't visualize it or do direct operations on the rows or columns.

While using pandas is more efficient to do direct operations on the rows and columns, I've read that iterating over a dataframe to get the rows in a list is not recommended even if it were to be done using df.iloc[2:] it would not give me the same result (saving each row in a specific list since the headers would always be there). However, unlike Openpyxl, doing direct operations on columns is much easier using something like df[col1]-df[col2] using the column name which is something I need to do. (Since just putting all columns values in a list wont do it for me)

So my question is whether or not there is a solution to be able to do what I want using only one of them, or if using both of them isn't that bad, keeping in mind I'd have to load the excel file twice.

"Thanks in Advance!"

CodePudding user response：

There is no problem to read an excel file once using openpyxl and then load rows to pandas:

pandas.DataFrame(row_list, columns=header)

You are right, iterating over a DataFrame using indexes is quite slow, but you have other options: apply(), iterrows(), itertuples()

Link: Different ways to iterate over rows in pandas DataFrame

I would also like to point out that your code probably does not do what you would like.

list(filter(None, header)) filters not only None, but all false-values such as 0 or "".
such filtering shifts the columns. for example, you have a row [1, None, 3] and columns ['a', 'b', 'c']. by filtering None, you will get [1, 3] which will relate to columns 'a' and 'b'.