Efficient way to append dataframes below each other-CodePudding

I have the following part of code:

for batch in chunk(df, n):
    unique_request = batch.groupby('clientip')['clientip'].count()
    unique_ua = batch.groupby('clientip')['name'].nunique()
    reply_length_avg = batch.groupby('clientip')['bytes'].mean()
    response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())

where I am extracting some values based on some columns of the DataFrame batch. Since the initial DataFrame df can be quite large, I need to find an efficient way of doing the following:

Putting together the results of the for loop in a new DataFrame with columns unique_request, unique_ua, reply_length_avg and response4xx at each iteration.
Stacking these DataFrames below of each other at each iteration.

I tried to do the following:

df_final = pd.DataFrame()
    
for batch in chunk(df, n):
    unique_request = batch.groupby('clientip')['clientip'].count()
    unique_ua = batch.groupby('clientip')['name'].nunique()
    reply_length_avg = batch.groupby('clientip')['bytes'].mean()
    response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
    concat = [unique_request, unique_ua, reply_length_avg, response4xx]
    df_final = pd.concat([df_final, concat], axis = 1, ignore_index = True)
    
return df_final

But I am getting the following error:

TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid

Any idea of what should I try?

CodePudding user response：

First of all avoid using pd.concat to build the main dataframe inside a for loop as it gets exponentially slower. The problem you are facing is that pd.concat should receive as input a list of dataframes, however you are passing [df_final, concat] which, in essence, is a list containing 2 elements: one dataframe and one list of dataframes. Ultimately, it seems you want to stack the dataframes vertically, thus axis should be 0 and not 1.

Therefore, I suggest you to do the following:

df_final = []

for batch in chunk(df, n):
    unique_request = batch.groupby('clientip')['clientip'].count()
    unique_ua = batch.groupby('clientip')['name'].nunique()
    reply_length_avg = batch.groupby('clientip')['bytes'].mean()
    response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
    concat = pd.concat([unique_request, unique_ua, reply_length_avg, response4xx], axis = 1, ignore_index = True)
    df_final.append(concat)

df_final = pd.concat(df_final, axis = 0, ignore_index = True)
    
return df_final

Note that pd.concat receives a list of dataframes and not a list that contains a list inside of it! Also, this approach is way faster since the pd.concat inside the for loop doesn't get bigger every iteration :)

I hope it helps!