I am looking to compare two Excel workbooks that should be identical to identify where there are differences between the content.
The below code I found here works great, but I have some workbooks that have many varying numbers of sheets (some will have one sheet, others will have 70 sheets across the two workbooks). Is there a way to iterate through all of the dataframes/sheets in the workbook (e.g. a range of indices) without having to hard code the index numbers? Thanks!
In block 1
sheet1 = rb1.sheet_by_index(0)
Then in block 2
sheet1 = rb1.sheet_by_index(1)
Then in block 3
sheet1 = rb1.sheet_by_index(2)
from itertools import izip_longest
import xlrd
rb1 = xlrd.open_workbook('file1.xlsx')
rb2 = xlrd.open_workbook('file2.xlsx')
sheet1 = rb1.sheet_by_index(0)
sheet2 = rb2.sheet_by_index(0)
for rownum in range(max(sheet1.nrows, sheet2.nrows)):
if rownum < sheet1.nrows:
row_rb1 = sheet1.row_values(rownum)
row_rb2 = sheet2.row_values(rownum)
for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
if c1 != c2:
print "Row {} Col {} - {} != {}".format(rownum 1, colnum 1, c1, c2)
else:
print "Row {} missing".format(rownum 1)```
CodePudding user response:
You can make use of "openpyxl" library to get the sheetnames for an excel.
from openpyxl import load_workbook
excel = load_workbook(filepath, read_only=True,keep_links=False)
sheet_names = excel.sheetnames
print(sheet_names) # prints the sheetnames of the excel in form of a list
You can iterate over the "sheet_names" variable
for sheet in sheet_names:
df = pd.read_excel(filepath,sheet,engine='openpyxl')