Reading a portion of a large xlsx file with python?-CodePudding

I have a large .xlsx file with 1 million rows. I don't want to open the whole file in one go. I was wondering if I can read a chunk of the file, process it and then read the next chunk? (I prefer to use pandas for it.)

import pandas as pd
# File size 50m

data = pd.read_excel("m.xlsx")

# It takes a long time to open a file

CodePudding user response：

Maybe you are looking for:

import pandas as pd

data = pd.read_excel("m.xlsx", nrows=50) # nrows are the number of rows to parse

Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html#:~:text=0, 2].-,nrows,-int, default None

CodePudding user response：

Here is an option to read a number of rows at a time and then use skiprows as offset for the next batch. For illustration, I had a file with 100 rows with data like this:

    Num     Factor
0   1   5
1   2   10
2   3   15
3   4   20
4   5   25
...
95  96  480
96  97  485
97  98  490
98  99  495
99  100 500

The following code reads 20 rows at a time from "Sheet1" skipping the header row and displays the first as well as last row column value of each batch to show progress:

rows_to_read = 20
for offset in range(1,101,rows_to_read):
    data = pd.read_excel(filename,"Sheet1",header=None,nrows=rows_to_read,skiprows=offset)
    print(data.iloc[0,0],data.iloc[rows_to_read-1,0])

Here is the output: