- "xls" is the variable representing the excel file
- "main" is a list of worksheets in the workbook "xls" to concatenate data from
- All columns are unmerged and borders are just for printing aesthetic reasons
Three sheets are imported correctly then the format problem occurs. Three sheets are then imported using this improper format. Then the problem occurs again by shifting the data in a similar way. So basically every forth sheet that is imported appears to pull the data from columns out of order.
The problem occurs when it moves to the next sheet, even though it's column formatting is the same as the last.
It appears to pull M:P correctly, then it jumbles the data by appearing to pull in this order: AC:AD, S:Z wile adding five extra blank columns, Q:R, AB:AC.
The only difference in the two worksheets is that the first has data in more columns than the second however, both have the save number of columns being queried.
df1 = [pd.read_excel(xls, sheet_name=s, skiprows=4, nrows=32, usecols='M:AD') for s in main]
dfconcat = pd.concat(df1, ignore_index=True, sort=False)
dfconcat.dropna(axis=0, how='all', inplace=True)
writer = pd.ExcelWriter(f'{loc}/test.xlsx')
dfconcat.to_excel(writer, 'bananas', index=False, header=False, na_rep='', merge_cells=False)
writer.save()
Since it occurs every fourth sheet, I assume there is something incorrect in my code, or something to add to it to reset something in pandas after every pass. Any guidance would be appreciated.
CodePudding user response:
Add header=None
at the end inside pd.read_excel
. By default, read_excel
will use the first row (header=0
) as the header. I.e. in your case, in view of skiprows=4
, ROW 5:5
in each sheet will be interpreted as the header.
This causes problems, when you use pd.concat
. E.g. if you have pd.concat([d1,d2])
and d1
has cols A, B
, but d2
has cols B, A
, then the result will actually have order A, B
, following the first df. Hence, the "shift" of the columns.
So, basically, you end up doing something like this:
dfs = [pd.DataFrame({'a':[1],'b':[2]}),
pd.DataFrame({'b':[1],'a':[2]})]
print(pd.concat(dfs, ignore_index=True, sort=False))
a b
0 1 2
1 2 1
While you actually want to do:
dfs = [pd.DataFrame([{0: 'a', 1: 'b'}, {0: 1, 1: 2}]),
pd.DataFrame([{0: 'b', 1: 'a'}, {0: 1, 1: 2}])]
print(pd.concat(dfs, ignore_index=True, sort=False))
0 1
0 a b
1 1 2
2 b a
3 1 2