Home > Blockchain >  Count occurences of integers starting with 4 in dataframe
Count occurences of integers starting with 4 in dataframe

Time:07-25

I have a dataframe in the following form:

        index              client_ip  http_response_code
                                                                                    
2022-07-23 05:10:10 00:00  172.19.0.1     300   
2022-07-23 06:13:26 00:00  192.168.0.1    400
          ...                 ...         ...   

I need to group by clientip and count the number of occurences of number 4xx in the column response, namely the times of occurences of integers start with 4.

What I have tried is the following:

df.groupby('client_ip')['http_response_code'].apply(lambda x: (str(x).startswith(str(4))).sum())

But I get the following error:

AttributeError: 'bool' object has no attribute 'sum'

However, if let's say that I need to find the number of occurences of 400, then the following does not give any error, although is still boolean:

df.groupby('client_ip')['http_response_code'].apply(lambda x: (x==400).sum())

Any idea of what is wrong here?

CodePudding user response:

Any idea of what is wrong here?

Your function get Series as input, comparing it against value gives Series of boolean values, which could be summed, using str functions gives str, which has not .sum. Use .astype(str) to convert each value into str rather than whole Series, example

import pandas as pd
df = pd.DataFrame({"User":["A","A","B"],"Status":[400,301,302]})
grouped = df.groupby("User")["Status"].apply(lambda x:x.astype(str).str.startswith("4").sum())
print(grouped)

output

User
A    1
B    0
Name: Status, dtype: int64

CodePudding user response:

IIUC, this should work for you:

import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'client_id': np.random.choice([1, 2, 3], size=10, replace=True, p=None), 'http_response_code': np.random.choice([300, 400], size=10, replace=True, p=None)})
print(df[df.http_response_code.apply(lambda x: (str(x).startswith(str(4))))].groupby('client_id').count())

Dataframe:

   client_id  http_response_code
0          3                 300
1          2                 400
2          3                 300
3          3                 400
4          1                 300
5          3                 400
6          3                 400
7          2                 300
8          3                 300
9          2                 300

Result:

           http_response_code
client_id                    
2                           1
3                           3
  • Related