Home > Enterprise >  Pandas importing column data in incorrect order into data frame
Pandas importing column data in incorrect order into data frame

Time:09-30

  • "xls" is the variable representing the excel file
  • "main" is a list of worksheets in the workbook "xls" to concatenate data from
  • All columns are unmerged and borders are just for printing aesthetic reasons

Three sheets are imported correctly then the format problem occurs. Three sheets are then imported using this improper format. Then the problem occurs again by shifting the data in a similar way. So basically every forth sheet that is imported appears to pull the data from columns out of order.

Original data: enter image description here

Output returns as expected: enter image description here

The problem occurs when it moves to the next sheet, even though it's column formatting is the same as the last.

Original data: enter image description here

Output returned: enter image description here

It appears to pull M:P correctly, then it jumbles the data by appearing to pull in this order: AC:AD, S:Z wile adding five extra blank columns, Q:R, AB:AC.

The only difference in the two worksheets is that the first has data in more columns than the second however, both have the save number of columns being queried.

df1 = [pd.read_excel(xls, sheet_name=s, skiprows=4, nrows=32, usecols='M:AD') for s in main]
dfconcat = pd.concat(df1, ignore_index=True, sort=False)
dfconcat.dropna(axis=0, how='all', inplace=True)
writer = pd.ExcelWriter(f'{loc}/test.xlsx')
dfconcat.to_excel(writer, 'bananas', index=False, header=False, na_rep='', merge_cells=False)
writer.save()

Since it occurs every fourth sheet, I assume there is something incorrect in my code, or something to add to it to reset something in pandas after every pass. Any guidance would be appreciated.

CodePudding user response:

Add header=None at the end inside pd.read_excel. By default, read_excel will use the first row (header=0) as the header. I.e. in your case, in view of skiprows=4, ROW 5:5 in each sheet will be interpreted as the header.

This causes problems, when you use pd.concat. E.g. if you have pd.concat([d1,d2]) and d1 has cols A, B, but d2 has cols B, A, then the result will actually have order A, B, following the first df. Hence, the "shift" of the columns.

So, basically, you end up doing something like this:

dfs = [pd.DataFrame({'a':[1],'b':[2]}),
       pd.DataFrame({'b':[1],'a':[2]})]

print(pd.concat(dfs, ignore_index=True, sort=False))

   a  b
0  1  2
1  2  1

While you actually want to do:

dfs = [pd.DataFrame([{0: 'a', 1: 'b'}, {0: 1, 1: 2}]),
       pd.DataFrame([{0: 'b', 1: 'a'}, {0: 1, 1: 2}])]

print(pd.concat(dfs, ignore_index=True, sort=False))

   0  1
0  a  b
1  1  2
2  b  a
3  1  2
  • Related