I have a large .xlsx file with 1 million rows. I don't want to open the whole file in one go. I was wondering if I can read a chunk of the file, process it and then read the next chunk? (I prefer to use pandas for it.)
import pandas as pd
# File size 50m
data = pd.read_excel("m.xlsx")
# It takes a long time to open a file
CodePudding user response:
Maybe you are looking for:
import pandas as pd
data = pd.read_excel("m.xlsx", nrows=50) # nrows are the number of rows to parse
CodePudding user response:
Here is an option to read a number of rows at a time and then use skiprows
as offset for the next batch. For illustration, I had a file with 100 rows with data like this:
Num Factor
0 1 5
1 2 10
2 3 15
3 4 20
4 5 25
...
95 96 480
96 97 485
97 98 490
98 99 495
99 100 500
The following code reads 20 rows at a time from "Sheet1" skipping the header row and displays the first as well as last row column value of each batch to show progress:
rows_to_read = 20
for offset in range(1,101,rows_to_read):
data = pd.read_excel(filename,"Sheet1",header=None,nrows=rows_to_read,skiprows=offset)
print(data.iloc[0,0],data.iloc[rows_to_read-1,0])
Here is the output:
1 20
21 40
41 60
61 80
81 100