I have the following part of code:
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
where I am extracting some values based on some columns of the DataFrame batch
. Since the initial DataFrame df
can be quite large, I need to find an efficient way of doing the following:
- Putting together the results of the for loop in a new DataFrame with columns
unique_request
,unique_ua
,reply_length_avg
andresponse4xx
at each iteration. - Stacking these DataFrames below of each other at each iteration.
I tried to do the following:
df_final = pd.DataFrame()
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = [unique_request, unique_ua, reply_length_avg, response4xx]
df_final = pd.concat([df_final, concat], axis = 1, ignore_index = True)
return df_final
But I am getting the following error:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
Any idea of what should I try?
CodePudding user response:
First of all avoid using pd.concat to build the main dataframe inside a for loop as it gets exponentially slower. The problem you are facing is that pd.concat
should receive as input a list of dataframes, however you are passing [df_final, concat] which, in essence, is a list containing 2 elements: one dataframe and one list of dataframes. Ultimately, it seems you want to stack the dataframes vertically, thus axis
should be 0
and not 1
.
Therefore, I suggest you to do the following:
df_final = []
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = pd.concat([unique_request, unique_ua, reply_length_avg, response4xx], axis = 1, ignore_index = True)
df_final.append(concat)
df_final = pd.concat(df_final, axis = 0, ignore_index = True)
return df_final
Note that pd.concat
receives a list of dataframes and not a list that contains a list inside of it! Also, this approach is way faster since the pd.concat
inside the for loop doesn't get bigger every iteration :)
I hope it helps!