Pandas importing column data in incorrect order into data frame-CodePudding

"xls" is the variable representing the excel file
"main" is a list of worksheets in the workbook "xls" to concatenate data from
All columns are unmerged and borders are just for printing aesthetic reasons

Three sheets are imported correctly then the format problem occurs. Three sheets are then imported using this improper format. Then the problem occurs again by shifting the data in a similar way. So basically every forth sheet that is imported appears to pull the data from columns out of order.

Original data:

Output returns as expected:

The problem occurs when it moves to the next sheet, even though it's column formatting is the same as the last.

Original data:

Output returned:

It appears to pull M:P correctly, then it jumbles the data by appearing to pull in this order: AC:AD, S:Z wile adding five extra blank columns, Q:R, AB:AC.

The only difference in the two worksheets is that the first has data in more columns than the second however, both have the save number of columns being queried.

df1 = [pd.read_excel(xls, sheet_name=s, skiprows=4, nrows=32, usecols='M:AD') for s in main]
dfconcat = pd.concat(df1, ignore_index=True, sort=False)
dfconcat.dropna(axis=0, how='all', inplace=True)
writer = pd.ExcelWriter(f'{loc}/test.xlsx')
dfconcat.to_excel(writer, 'bananas', index=False, header=False, na_rep='', merge_cells=False)
writer.save()

Since it occurs every fourth sheet, I assume there is something incorrect in my code, or something to add to it to reset something in pandas after every pass. Any guidance would be appreciated.

CodePudding user response：

Add header=None at the end inside pd.read_excel. By default, read_excel will use the first row (header=0) as the header. I.e. in your case, in view of skiprows=4, ROW 5:5 in each sheet will be interpreted as the header.

This causes problems, when you use pd.concat. E.g. if you have pd.concat([d1,d2]) and d1 has cols A, B, but d2 has cols B, A, then the result will actually have order A, B, following the first df. Hence, the "shift" of the columns.

So, basically, you end up doing something like this:

dfs = [pd.DataFrame({'a':[1],'b':[2]}),
       pd.DataFrame({'b':[1],'a':[2]})]

print(pd.concat(dfs, ignore_index=True, sort=False))

   a  b
0  1  2
1  2  1

While you actually want to do:

dfs = [pd.DataFrame([{0: 'a', 1: 'b'}, {0: 1, 1: 2}]),
       pd.DataFrame([{0: 'b', 1: 'a'}, {0: 1, 1: 2}])]

print(pd.concat(dfs, ignore_index=True, sort=False))

   0  1
0  a  b
1  1  2
2  b  a
3  1  2