I have a dataframe in the following form:
index client_ip http_response_code
2022-07-23 05:10:10 00:00 172.19.0.1 300
2022-07-23 06:13:26 00:00 192.168.0.1 400
... ... ...
I need to group by clientip
and count the number of occurences of number 4xx in the column response
, namely the times of occurences of integers start with 4.
What I have tried is the following:
df.groupby('client_ip')['http_response_code'].apply(lambda x: (str(x).startswith(str(4))).sum())
But I get the following error:
AttributeError: 'bool' object has no attribute 'sum'
However, if let's say that I need to find the number of occurences of 400, then the following does not give any error, although is still boolean:
df.groupby('client_ip')['http_response_code'].apply(lambda x: (x==400).sum())
Any idea of what is wrong here?
CodePudding user response:
Any idea of what is wrong here?
Your function get Series as input, comparing it against value gives Series of boolean values, which could be summed, using str
functions gives str, which has not .sum
. Use .astype(str)
to convert each value into str rather than whole Series, example
import pandas as pd
df = pd.DataFrame({"User":["A","A","B"],"Status":[400,301,302]})
grouped = df.groupby("User")["Status"].apply(lambda x:x.astype(str).str.startswith("4").sum())
print(grouped)
output
User
A 1
B 0
Name: Status, dtype: int64
CodePudding user response:
IIUC, this should work for you:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'client_id': np.random.choice([1, 2, 3], size=10, replace=True, p=None), 'http_response_code': np.random.choice([300, 400], size=10, replace=True, p=None)})
print(df[df.http_response_code.apply(lambda x: (str(x).startswith(str(4))))].groupby('client_id').count())
Dataframe:
client_id http_response_code
0 3 300
1 2 400
2 3 300
3 3 400
4 1 300
5 3 400
6 3 400
7 2 300
8 3 300
9 2 300
Result:
http_response_code
client_id
2 1
3 3