When I read this file with pandas, only 15 of the 44 rows are read, seemingly without a system. Cell E31 is set to nan
, and I'm unable to find out why.
How to reproduce:
% pip show pandas
Name: pandas
Version: 1.5.1
(...)
% python3
Python 3.10.7 (main, Sep 15 2022, 01:51:29) [Clang 14.0.0 (clang-1400.0.29.102)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> test = pandas.read_excel("Downloads/invalid_file.xlsx")
>>> test
name super object ... comment_rm gui_element gui_attributes
0 hasSignature hasValue TextValue ... NaN SimpleText NaN
1 hasCreator hasValue TextValue ... NaN SimpleText NaN
2 hasProvenience hasValue TextValue ... NaN SimpleText NaN
3 hasDate hasValue TextValue ... NaN SimpleText NaN
4 hasLanguage hasValue TextValue ... NaN SimpleText NaN
5 hasCopyright hasValue TextValue ... NaN SimpleText NaN
6 hasComment hasValue TextValue ... NaN Textarea NaN
7 hasDescription hasValue TextValue ... NaN Richtext NaN
8 hasLatinName hasValue, dcterms:title TextValue ... NaN SimpleText NaN
9 hasSynonym hasValue Synonyme ... NaN SimpleText NaN
10 hasPagenum seqnum IntValue ... NaN SimpleText NaN
11 partOf isPartOf :Manuscript ... NaN Searchbox NaN
12 hasTitle hasValue TextValue ... NaN SimpleText NaN
13 hasFolio hasValue TextValue ... NaN SimpleText NaN
14 hasPlant hasLinkTo :Plant ... NaN Searchbox NaN
[15 rows x 15 columns]
>>> test.iloc[9,4]
nan
I carefully checked the docs if I have to set a certain parameter, but I found nothing.
CodePudding user response:
Thanks to Сергей Кох's answer, I was able to find the problem: Without my knowledge, there are several hidden sheets in my Excel file. Only one is visible (the one with the 44 rows), but it is not in the first position. The one in the first position is a hidden one that happens to be very similar to the visible one, except that it only contains 15 rows.
By default, pandas reads the first sheet, which in my case led to a difficult-to-debug constellation.
(I find it strange that such a file exists, but I have a suspicion about who could be the culprit: The file was on Google Drive, and was perhaps opened with Google Sheets.)
I will program my library so that it opens the only visible sheet, instead of the first one. Perhaps I can do it as described here. Until now, I had thought that if I write in the docs of my library that I expect an Excel file with one sheet, then I could just naively read the first sheet.
CodePudding user response:
I saved the invalid .xlsx file as csv, clicked OK on all excel warnings and read it pandas.read_csv. All lines are read.I opened the file directly with the openpyxl library, it also read completely. The only way that worked was the second way to load excel files in pandas via file object
import pandas as pd
test = pd.read_excel(open('invalid_file.xlsx', 'rb'), sheet_name='Sheet1')
print(test.to_string())
Moreover, if you remove sheet_name='Sheet1', then this pseudo-random enumeration of columns with incomplete loading appears.