Home > other >  Pandas dataframe concat after reading large number of txt files using glob takes never ending time
Pandas dataframe concat after reading large number of txt files using glob takes never ending time

Time:09-03

There are some 50k txt files which I am trying to read to pandas dataframe as per code below. But the process is still running for 2 hrs. Is there any better way to speed up this?

folder_path = '/drive/My Drive/dataset/train'
file_list = glob.glob(folder_path   "/*.txt")


def read_clean_df(file_name) -> pd.DataFrame:
    df = pd.read_fwf(file_name, header=None)
    df = df.drop(df.index[19])    
    df = df.T           
    df.columns = df.iloc[0]
    df = df[1:]
    df.reset_index(drop=True, inplace=True)
    return df


train_df = read_clean_df(file_list[0])

for file_name in file_list[1:len(file_list)]:
    df = read_clean_df(file_name)
    train_df = pd.concat([train_df, df], axis=0)

train_df.reset_index(drop=True, inplace=True)
print(train_df.head(30))

CodePudding user response:

Yeah, repeatedly calling concat is slow, this is the reason DataFrame.append was deprecated

Instead, do

dfs = []

for file_name in file_list:
    df = read_clean_df(file_name)
    dfs.append(df)

train_df = pd.concat(dfs)

CodePudding user response:

Do the concatenation once, at the end:

dfs = []
for file_name in file_list:
    df = read_clean_df(file_name)
    dfs.append(df)
tran_df = pd.concat(dfs, axis=0)

If that is not fast enough, use datatable which can do multithreaded IO reading of csv files:

df = dt.rbind(iread(file_list)).to_pandas()
  • Related