With CSV files I sometimes use nrows=
parameter for debugging purpose and to "speed up" the reading of the file which is an XLSX file.
I tested the same parameter with pandas.read_excel()
reading an over 400k lines excel file. But reading that file take round about 3 minutes and 20 seconds no matter if I do nrows=10
or if I don't use nrows
.
The result is of course only 10 rows.
I assume this is because of fhe Excel-File format where it is not possible to physically skip/ignore lines while reading?
CodePudding user response:
Parsing an XLSX file involves opening a ZIP (OOXML documents are zips of XML files), parsing some XML to find out what sheets there are, then parsing the particular sheet's XML and interpreting the contents to figure out the contents of each cell, etc.
That's not quite as straightforward as opening a text file and only reading ten lines.
I might recommend reading the XLS(X) file once into a dataframe, and then e.g. pickling that dataframe for subsequent use. If you're feeling fancy, you could write a function that invisibly does that for you (tries to look for a "cached" pickled version of your document(s)).