I read a bunch of pickle files with the below code, I want to loop through and get each of these, identify the length of each file. Ie how many records.
Two issues:
- Concat will combine all my dfs into one, which takes a long time. Anyone to just read the len?
- If Concat is the way to go, how can I get the length of each file if they all go into one dataframe? I guess the problem is here to identify where each file stops and starts. I could add a column to identify each filename and count there I suspect.
What ive tried:
import pandas as pd
import glob, os
files = glob.glob('O:\Stack\Over\Flow\*.pkl')
df = pd.concat([pd.read_pickle(fp, compression='xz').assign(New=os.path.basename(fp)) for fp in files])
Any help would be appreciated.
CodePudding user response:
Append to a list first then pd.concat due to quadratic copying undesired effects of appending or concatenating inside a for loop.
import pandas as pd
import glob, os
files = glob.glob('O:\Stack\Over\Flow\*.pkl')
dfs = []
for fp in files:
df = pd.read_pickle(fp, compression='xz').assign(New=os.path.basename(fp))
dfs.append(df)
# or as @G.Anderson points out maybe
dfs.append(len(df))
pd.concat(dfs)
CodePudding user response:
If you only want the lengths of individual dataframes, then the call to concat
is entirely unnecessary overhead. To repurpose your own code, you're already building the dataframes from the files, you can just use those to capture only the lengths.
import pandas as pd
import glob, os
files = glob.glob('O:\Stack\Over\Flow\*.pkl')
#a call to assign should also be irrelevant because adding a column doesn't change the length
lens=[len(pd.read_pickle(fp, compression='xz')) for fp in files]