Home > Net >  resuse agg columns panda group by
resuse agg columns panda group by

Time:07-13

I have below contents in csv file:

key1    key2    Key3    key4    key5

Val1    A        51     'True'  25
Val1    A        50     'False' 25
Val1    A        49     'True'  25
Val1    A        48     'True'  25
Val2    A        47     'False' 25
Val2    A        46     'True'  25
Val2    A        45     'False' 25
Val2    A        44     'True'  25
Val2    A        43     'True'  25

Output should be like:

key1 key2 max_key5 total_key4 total_true_key4 grade
Val1  A   51       4          3                1
Val2  A   47       5          3                2

I have to group by key1 and key2 then find the max of key5 and count of total rows of key4 and count of total true rows of key4 and then percentage.

What I am trying:

  grd = "1 if avg > 80 else 2 if avg > 50 else c"

    pct = lambda x: (1 if x > 80 else (2 if x > 50 else 3))



json_data
    .assign(_key4=lambda df_: df_['key4'] == "'True'")
    .groupby(['key1', 'key2'])
    .agg(
        maxkey5=('key5', 'max'), 
        total_key4=('key4', 'count'), 
        total_true_key4=('_key4', 'sum')
    )
   .eval('avg = (total_true_key4 * 100) / total_key4')
   .eval('feg = grd')
  #.apply(pct(avg))

eval to calculate percentage is working fine.. not able to do if else if on avg column

I don't want to apply separately like in another separate statement.

CodePudding user response:

how about this?

json_data
    .assign(_key4=lambda df_: df_['key4'] == "'True'")
    .groupby(['key1', 'key2'])
    .agg(
        maxkey5=('key5', 'max'), 
        total_key4=('key4', 'count'), 
        total_true_key4=('_key4', 'sum')
        percentage=('key4', lambda x: int(sum(x)/len(x)*100))
    )

CodePudding user response:

As mentioned in the comments, I am not quite sure how you would get your desired output with your logic. Also, you define maxkey5=('key5', 'max') but actually you are getting the max of key3. So, if I am not mistaken, you are looking for something like this:

import pandas as pd
from numpy import mean

df = pd.DataFrame({
    "key1": ["Val1"]*4 ["Val2"]*5,
    "key2": ["A"]*9,
    "key3": [51, 50, 49, 48, 47, 46, 45, 44, 43],
    "key4": ["'True'", "'False'", "'True'", "'True'", "'False'", "'True'", "'False'", "'True'", "'True'"],
    "key5": [25] * 9
})


(
    df
    .assign(total_true_key4=df['key4']=="'True'", grade=df['key4']=="'True'")
    .groupby(["key1", "key2"])[["key3", "key4", "total_true_key4", "grade"]]
    .agg(
        {"key3": "max", 
         "key4": "count",
         "total_true_key4": "sum",
         "grade": lambda x: 1 if mean(x)>.8 else (2 if mean(x)>.5 else 3)}
    )
    .rename(columns={"key3":"max_key3", "key4":"total_key4"}).reset_index()
)

which results into the following data frame:

   key1   key2  max_key3  total_key4  total_true_key4  grade
0  Val1   A     51        4           3                2
1  Val2   A     47        5           3                2
  • Related