Sort list of CSV in a folder that contain 10k files faster-CodePudding

Hi I'm a newbie in Python and in coding in general. this is my very first post.

I am trying to open and concatenate the last 20 files into a dataframe.

I am succesuful in doing so when i am working with a test folder that contain only 100 files, but as soon as i try my code in the real folder that contain 10k files my code is very slow and take like 5 minutes to finish.

Here is my try :

import pandas as pd
import glob
from datetime import datetime
import numpy as np
import os

path = r'K:/industriel/abc/03_LOG/PRODUCTION/CSV/'

path2 = r'K:/industriel/abc/03_LOG/PRODUCTION/IMG/'

os.chdir(path)
files = glob.glob(path   "/*.csv")
#files = filter(os.path.isfile, os.listdir(path))
files = [os.path.join(path, f) for f in files]
files.sort(key=lambda x: os.path.getctime(x), reverse=False)
dfs = pd.DataFrame()
for i in range(20):
    dfs = dfs.append(pd.read_csv(files[i].split('\\')[-1],delimiter=';', usecols=[0,1,3,4,9,10,20]))

dfs = dfs.reset_index(drop=True)

print(dfs.head(10))

CodePudding user response：

Try reading all the individual files to a list and then concat to form your dataframe at the end:

files = [os.path.join(path, f) for f in os.listdir(path) if f.endswith(".csv")]
files.sort(key=lambda x: os.path.getctime(x), reverse=False)
dfs = list()
for i, file in enumerate(files[:20]):
    dfs.append(pd.read_csv(file, delimiter=';', usecols=[0,1,3,4,9,10,20]))
dfs = pd.concat(dfs)

CodePudding user response：

You can use pd.concat() with a list of read files. You can replace your code after files.sort(...) with the following

dfs = pd.concat([
    pd.read_csv(files[i].split('\\')[-1], delimiter=';',  usecols=[0,1,3,4,9,10,20])
    for file in files[20:]
])
dfs = dfs.reset_index(drop=True)
print(dfs.head(10))