There are some 50k txt files which I am trying to read to pandas dataframe as per code below. But the process is still running for 2 hrs. Is there any better way to speed up this?
folder_path = '/drive/My Drive/dataset/train'
file_list = glob.glob(folder_path "/*.txt")
def read_clean_df(file_name) -> pd.DataFrame:
df = pd.read_fwf(file_name, header=None)
df = df.drop(df.index[19])
df = df.T
df.columns = df.iloc[0]
df = df[1:]
df.reset_index(drop=True, inplace=True)
return df
train_df = read_clean_df(file_list[0])
for file_name in file_list[1:len(file_list)]:
df = read_clean_df(file_name)
train_df = pd.concat([train_df, df], axis=0)
train_df.reset_index(drop=True, inplace=True)
print(train_df.head(30))
CodePudding user response:
Yeah, repeatedly calling concat
is slow, this is the reason DataFrame.append
was deprecated
Instead, do
dfs = []
for file_name in file_list:
df = read_clean_df(file_name)
dfs.append(df)
train_df = pd.concat(dfs)
CodePudding user response:
Do the concatenation once, at the end:
dfs = []
for file_name in file_list:
df = read_clean_df(file_name)
dfs.append(df)
tran_df = pd.concat(dfs, axis=0)
If that is not fast enough, use datatable
which can do multithreaded IO reading of csv files:
df = dt.rbind(iread(file_list)).to_pandas()