Home > Back-end >  Pandas read_excel wrong output
Pandas read_excel wrong output

Time:09-22

Update: so I'm messing around with the two files, I copied the data from DOB column from the 2nd file to the 1st file to make the files look visually identical. However I notice some really interesting behaviour when using Ctrl F in Microsoft Excel. When I leave the search box blank in the first file it finds no matches. However when I repeat the same operation with the 2nd file it finds 21 matches from each cell between E1 to G7. I suppose somehow there are some blank/invisible cells in the 2nd file - and that's what's causing read_excel to behave differently.

My goal is to simply execute the pandas read_excel function. However, I'm running into a very strange situation in which I am trying to run pandas.read_excel on two very similar excel files but am getting substantially different results.

Code

import pandas

data1 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_1.xlsx')
print(data1)
data2 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_2.xlsx')
print(data2)

Output

        Name        DOB Class Year  GPA
0  Redeye438 2008-09-22      Fresh    1
1  Redeye439 2009-09-20       Soph    2
2  Redeye440 2010-09-22     Junior    3
3  Redeye441 2011-09-20     Senior    4
4  Redeye442 2008-09-20      Fresh    4
5  Redeye443 2009-09-22       Soph    3
                             Name  DOB  Class Year  GPA
Redeye438 2011-09-20 Fresh      1  NaN         NaN  NaN
Redeye439 2010-09-22 Soph       2  NaN         NaN  NaN
Redeye440 2009-09-20 Junior     3  NaN         NaN  NaN
Redeye441 2008-09-22 Senior     4  NaN         NaN  NaN
Redeye442 2011-09-22 Fresh      4  NaN         NaN  NaN
Redeye443 2010-09-20 Soph       3  NaN         NaN  NaN

Why are the columns mapped incorrectly for data2?

The excel files in question (the only difference is the data in the DOB column): Screenshot of test_1.xlsx Screenshot of test_2.xlsx

CodePudding user response:

It looks like you're using an older version of Pandas because I can't reproduce the issue on the latest version.

import pandas as pd 

pd.show_versions()

INSTALLED VERSIONS
------------------
python           : 3.10.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
pandas           : 1.5.0

As you can see below, the columns of data2 are mapped correctly.

import pandas as pd

data1 = pd.read_excel(r"C:\Users\abokey\Downloads\test_1.xlsx")
print(data1)
data2 = pd.read_excel(r"C:\Users\abokey\Downloads\test_2.xlsx")
print(data2)

        Name        DOB Class Year  GPA
0  Redeye438 2008-09-22      Fresh    1
1  Redeye439 2009-09-20       Soph    2
2  Redeye440 2010-09-22     Junior    3
3  Redeye441 2011-09-20     Senior    4
4  Redeye442 2008-09-20      Fresh    4
5  Redeye443 2009-09-22       Soph    3
        Name        DOB Class Year  GPA
0  Redeye438 2011-09-20      Fresh    1
1  Redeye439 2010-09-22       Soph    2
2  Redeye440 2009-09-20     Junior    3
3  Redeye441 2008-09-22     Senior    4
4  Redeye442 2011-09-22      Fresh    4
5  Redeye443 2010-09-20       Soph    3

However, you're right because the format of the two Excel files are not the same. In fact, when looking at test2.xlsx more carefully, this one seems to carry 7x3 blank cells (rows/cols). The latest version of pandas seems to handle this kind of Excel file since the empty cells are ignored when calling enter image description here

  • Related