I'm trying to process an excel file so that i can use each row and column for specific operations later on.
My problem is as follows:
- Using Openpyxl made it easier for me to load the file and be able to iterate over the rows
#reading the excel file
path = r'Datasets/Chapter 1/Table B1.1.xlsx'
wb = load_workbook(path) #loading the excel table
ws = wb.active #grab the active worksheet
#Setting the doc Header
for h in ws.iter_rows(max_row = 1, values_only = True): #getting the first row (Headers) in the table
header = list(h)
for sh in ws.iter_rows(min_row = 1 ,max_row = 2, values_only = True):
sub_header = list(sh)
#removing all of the none Values
header = list(filter(None, header))
sub_header = list(filter(None, sub_header))
#creating a list of all the rows in the excel file
row_list = []
for row in ws.iter_rows(min_row=3): #Iteration over every single row starting from the third row since first two are the headers
row = [cell.value for cell in row] #Creating a list from each row
row = list(filter(None, row)) #removing the none values from each row
row_list.append(row) #creating a list of all rows (starting from the 3d one)
colm = []
for col in ws.iter_cols(min_row=3,min_col = 1): #Iteration over every single row starting from the third row since first two are the headers
col = [cell.value for cell in col] #Creating a list from each row
col = list(filter(None, col)) #removing the none values from each row
colm.append(col) #creating a list of all rows (starting from the 3d one)
but at the same time (as far as I've read in the docs), I can't visualize it or do direct operations on the rows or columns.
- While using pandas is more efficient to do direct operations on the rows and columns, I've read that iterating over a dataframe to get the rows in a list is not recommended even if it were to be done using
df.iloc[2:]
it would not give me the same result (saving each row in a specific list since the headers would always be there). However, unlike Openpyxl, doing direct operations on columns is much easier using something likedf[col1]-df[col2]
using the column name which is something I need to do. (Since just putting all columns values in a list wont do it for me)
So my question is whether or not there is a solution to be able to do what I want using only one of them, or if using both of them isn't that bad, keeping in mind I'd have to load the excel file twice.
"Thanks in Advance!"
CodePudding user response:
There is no problem to read an excel file once using openpyxl and then load rows to pandas:
pandas.DataFrame(row_list, columns=header)
You are right, iterating over a DataFrame
using indexes is quite slow, but you have other options: apply()
, iterrows()
, itertuples()
Link: Different ways to iterate over rows in pandas DataFrame
I would also like to point out that your code probably does not do what you would like.
list(filter(None, header))
filters not only None, but all false-values such as0
or""
.- such filtering shifts the columns. for example, you have a row
[1, None, 3]
and columns['a', 'b', 'c']
. by filtering None, you will get[1, 3]
which will relate to columns'a'
and'b'
.