Storing and accessing dataframe objects-CodePudding

I'm trying to read some excel files as pandas dataframes. The problem is they are quite large (about 2500 rows, columns up to 'CYK' label in the excel sheet, and there are 14 of them).

Every time that I run my program, it has to import again the files from excel. This causes the runtime to grow a lot, currently it's a bit more than 15 minutes and as of now the program doesn't even do anything significant except importing the files.

I would like to be able to import the files just once, then save the dataframe objects somewhere and make my program work only on those dataframes.

Any suggestions?

This is the code I developed until now:

import pandas as pd
import os

path = r'C:/Users/damia/Dropbox/Tesi/WIOD'
dirs = os.listdir(path)

complete_dirs = []
for f in dirs:
    complete_dirs.append(path   r"/"   f)

data = []
for el in complete_dirs:
    wiod = pd.read_excel(el, engine='pyxlsb')
    data.append(wiod)

If anyone is interested, you can find the files I'm trying to read at this link: http://www.wiod.org/database/wiots16

CodePudding user response：

You could use the to_pickle and read_pickle methods provided by pandas to serialize the dataframes and store them in files. docs

Example pickling:

data = []
pickle_paths = []
for el in complete_dirs:
    wiod = pd.read_excel(el, engine='pyxlsb')
    # here's where you store it
    pickle_loc = 'your_unique_path_to_save_this_frame'
    wiod.to_pickle(pickle_loc)
    pickle_paths.append(pickle_loc)
    data.append(wiod)

Depickling

data = []
for el in pickle_paths:
    data.append(pd.read_pickle(el))

CodePudding user response：

Another solution using to_pickle and read_pickle.

As an aside, you can read Excel files directly from URLs if you don't want to save to your drive first.

#read each file from the URL and save to disk
for year in range(2000, 2015):
    pd.read_excel(f"http://www.wiod.org/protected3/data16/wiot_ROW/WIOT{year}_Nov16_ROW.xlsb").to_pickle(f"{year}.pkl")

#read pickle files from disk to a dataframe
data = list()    
for year in range(2000, 2015):
    data.append(pd.read_pickle(f"{year}.pkl"))
data = pd.concat(data)