Pandas dataframe concat after reading large number of txt files using glob takes never ending time-CodePudding

There are some 50k txt files which I am trying to read to pandas dataframe as per code below. But the process is still running for 2 hrs. Is there any better way to speed up this?

folder_path = '/drive/My Drive/dataset/train'
file_list = glob.glob(folder_path   "/*.txt")


def read_clean_df(file_name) -> pd.DataFrame:
    df = pd.read_fwf(file_name, header=None)
    df = df.drop(df.index[19])    
    df = df.T           
    df.columns = df.iloc[0]
    df = df[1:]
    df.reset_index(drop=True, inplace=True)
    return df


train_df = read_clean_df(file_list[0])

for file_name in file_list[1:len(file_list)]:
    df = read_clean_df(file_name)
    train_df = pd.concat([train_df, df], axis=0)

train_df.reset_index(drop=True, inplace=True)
print(train_df.head(30))

CodePudding user response：

Yeah, repeatedly calling concat is slow, this is the reason DataFrame.append was deprecated

Instead, do

dfs = []

for file_name in file_list:
    df = read_clean_df(file_name)
    dfs.append(df)

train_df = pd.concat(dfs)

CodePudding user response：

Do the concatenation once, at the end:

dfs = []
for file_name in file_list:
    df = read_clean_df(file_name)
    dfs.append(df)
tran_df = pd.concat(dfs, axis=0)

If that is not fast enough, use datatable which can do multithreaded IO reading of csv files:

df = dt.rbind(iread(file_list)).to_pandas()