How to speed up importing multiple csv files, do some cleaning of the data and then blend them toget-CodePudding

I have multiple csv files on which I have to remove 2 rows because they are only NaNs. I want to load the first one, perform the cleaning and then load the second one do the cleaning and concatenate with the first one and so on.

This is the code:

df_result = None
for file in tqdm(files):
    df = pd.read_csv(file)
    df = clean_csv(df)
    df = df.to_numpy()
    try:
        df_result = pd.concat([df_result,df],axis = 'index',ignore_index=True)
    except:
        df_result = df

with clean_csv:

def clean_csv(df):
    df_1 = df.drop(labels = [0,1])
    df_1 = df_1.drop('Start Time', axis = 1)
        
    return df_1

CodePudding user response：

Another way could be by appending the df's to the list and then concatenating after the for loop like this because you are currently doing the concatenation on each iteration(I guess that may slow up your script).

df_result = []
for file in tqdm(files):
    df = pd.read_csv(file, index_col=None, header=0)
    df = clean_csv(df)
    df = df.to_numpy()
    df_result.append(df)

df_final = pd.concat(df_result, axis=0, ignore_index=True)

CodePudding user response：

Concatenation becomes slower the longer your total string becomes. So, when you try to write a script that adds on small bits of data in iteration like this, it quickly starts to slow down as your concatenated string gets larger. So on the first pass it will need to process 1 line of data, then 2, then 3, etc. with each successive pass having to handle more total data.

A solution I've used in the past is to create chunks of smaller data, and then concatenate the chunks when done to minimize the number of long string passes you have to make. The ideal size for a chunk is the square root of the total size of your data set. So, if for example you have 10,000 lines of data to process, you can concatenate packets 100 lines long each, and then concatenate those 100 line packets onto the final packet.

If you don't break up the data, it means that you are processing a total of 50,000,000 lines of data due to multiple passes over the same data, but by breaking up the data into packets like this, you only end up processing 1,000,000 lines of data.