Home > Software design >  pandas.read_excel() randomly skips rows, sets cells to nan
pandas.read_excel() randomly skips rows, sets cells to nan

Time:11-05

When I read this file with pandas, only 15 of the 44 rows are read, seemingly without a system. Cell E31 is set to nan, and I'm unable to find out why.

How to reproduce:

% pip show pandas
Name: pandas
Version: 1.5.1
(...)
% python3
Python 3.10.7 (main, Sep 15 2022, 01:51:29) [Clang 14.0.0 (clang-1400.0.29.102)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> test = pandas.read_excel("Downloads/invalid_file.xlsx")
>>> test
              name                    super       object  ... comment_rm gui_element  gui_attributes
0     hasSignature                 hasValue    TextValue  ...        NaN  SimpleText             NaN
1       hasCreator                 hasValue    TextValue  ...        NaN  SimpleText             NaN
2   hasProvenience                 hasValue    TextValue  ...        NaN  SimpleText             NaN
3          hasDate                 hasValue    TextValue  ...        NaN  SimpleText             NaN
4      hasLanguage                 hasValue    TextValue  ...        NaN  SimpleText             NaN
5     hasCopyright                 hasValue    TextValue  ...        NaN  SimpleText             NaN
6       hasComment                 hasValue    TextValue  ...        NaN    Textarea             NaN
7   hasDescription                 hasValue    TextValue  ...        NaN    Richtext             NaN
8     hasLatinName  hasValue, dcterms:title    TextValue  ...        NaN  SimpleText             NaN
9       hasSynonym                 hasValue     Synonyme  ...        NaN  SimpleText             NaN
10      hasPagenum                   seqnum     IntValue  ...        NaN  SimpleText             NaN
11          partOf                 isPartOf  :Manuscript  ...        NaN   Searchbox             NaN
12        hasTitle                 hasValue    TextValue  ...        NaN  SimpleText             NaN
13        hasFolio                 hasValue    TextValue  ...        NaN  SimpleText             NaN
14        hasPlant                hasLinkTo       :Plant  ...        NaN   Searchbox             NaN

[15 rows x 15 columns]
>>> test.iloc[9,4]
nan

I carefully checked the docs if I have to set a certain parameter, but I found nothing.

CodePudding user response:

Thanks to Сергей Кох's answer, I was able to find the problem: Without my knowledge, there are several hidden sheets in my Excel file. Only one is visible (the one with the 44 rows), but it is not in the first position. The one in the first position is a hidden one that happens to be very similar to the visible one, except that it only contains 15 rows.

By default, pandas reads the first sheet, which in my case led to a difficult-to-debug constellation.

(I find it strange that such a file exists, but I have a suspicion about who could be the culprit: The file was on Google Drive, and was perhaps opened with Google Sheets.)

I will program my library so that it opens the only visible sheet, instead of the first one. Perhaps I can do it as described here. Until now, I had thought that if I write in the docs of my library that I expect an Excel file with one sheet, then I could just naively read the first sheet.

CodePudding user response:

I saved the invalid .xlsx file as csv, clicked OK on all excel warnings and read it pandas.read_csv. All lines are read.I opened the file directly with the openpyxl library, it also read completely. The only way that worked was the second way to load excel files in pandas via file object

import pandas as pd

test = pd.read_excel(open('invalid_file.xlsx', 'rb'), sheet_name='Sheet1')
print(test.to_string())

Moreover, if you remove sheet_name='Sheet1', then this pseudo-random enumeration of columns with incomplete loading appears.

  • Related