I have an extremely simple .xlsx file and pandas is not reading the first row completely. It's very strange since it only reads one of the columns, and the others are blank. After A LOT of trial and error, it seems there's something hidden in the Excel file itself, since if I remove completely the row, and I just type it all again, then it works.
However, there's nothing visual that I can see. If I export the file to .csv then pandas works as well.
I'm using python 3.7 with pandas 1.1.5. I tried upgrading pandas but I can't, pip tells me I'm using the latest available version, even though I see that pandas 1.3 is available. Not sure if this is already fixed in a new version, and if it is, how do I get it installed (I'm using the app both on Mac and on Windows via Anaconda).
The xlsx file showing the problem is here:
It just contains the header and a row. That's it.
The script to read it is this:
import pandas as pd
print(f"pandas version is {pd.__version__}")
df = pd.read_excel('Book1.xlsx', dtype=str)
df = df.fillna('')
print(f"columns are {df.columns.tolist()}")
print(df)
And the output is this:
anibal@~/PycharmProjects/CIUSSS$ python3 test.py
pandas version is 1.1.5
columns are ['Source']
Source
SNOMED CT 115161005 Genus Abiotrophia (organism) Abiotrophia Genus Abiotrophia
Where it should be:
anibal@~/PycharmProjects/CIUSSS$ python3 test.py
pandas version is 1.1.5
columns are ['Source', 'f2', 'f3', 'f4', 'f5']
Source f2 f3 f4 f5
0 SNOMED CT 115161005 Genus Abiotrophia (organism) Abiotrophia Genus Abiotrophia
Can somebody please tell me if there's something different that I should be doing in the API to be able to read this? Or if I need to have a newer version of pandas, how do I get a newer version with pip (and then in anaconda)?
CodePudding user response:
Update: the issue was indeed the version. I tried the exact same file with python 3.9.9 and pandas 1.3.4 and everything looks good.