Python - Group by with multiple conditions on columns-CodePudding

I have the following example dataframe:

data = {'ref':['1', '2', '3', '4', '5'],
        'checked':[True, True, True, False, True],
        'rag':['r', 'r', 'g', 'a', 'r'],
        'group':['high', 'low', 'high', 'medium', 'high']}

dataframe = pd.DataFrame(data)

I want to group on group and do some conditional counts where certain conditions are met so I get the following:

I can group by group and do n by the following:

df = dataframe.groupby(['group']).agg(
         n=('ref', 'count')
        ).reset_index()

But I am struggling to also count the number of times for each group that:

checked = True
rag = g
rag = a
rag = r

Any help would be much appreciated!

edit: changed True/False strings to Boolean

CodePudding user response：

You have a few challenges.

For instance, your True/False are strings, so you should either initialize them as booleans and use sum or convert to boolean during aggregation.

To count the rag, it's easier to use pandas.crosstab and join it while you still have the groups as index.

df = (dataframe
      .groupby(['group'])
      .agg(**{'n': ('ref', 'count'),
              'checked=True': ('checked', lambda s: s.eq('True').sum()),
           })
      .join(pd.crosstab(dataframe['group'], dataframe['rag'])
              .add_prefix('rag=')
           )
      .reset_index()
     )

output:

    group  n  checked=True  rag=a  rag=g  rag=r
0    high  3             3      0      1      2
1     low  1             1      0      0      1
2  medium  1             0      1      0      0

CodePudding user response：

You can try pivot_table separately on your checked and rag columns

n = df.groupby(['group']).agg(n=('ref', 'count'))

dfs = []
for column in ['checked', 'rag']:
    df_ = (df.pivot_table(index='group', columns=[column], values='ref',
                          aggfunc='count', fill_value=0)
          .rename(columns=lambda col: f'{column}={col}')
          .rename_axis(None, axis=1))
    dfs.append(df_)
df = pd.concat(dfs, axis=1).drop('checked=False', axis=1)

print(n.join(df))

        n  checked=True  rag=a  rag=g  rag=r
group
high    3             3      0      1      2
low     1             1      0      0      1
medium  1             0      1      0      0

CodePudding user response：

There is an issue with your data example:

data = {'ref':['1', '2', '3', '4', '5'],
        'checked':[True, True, True, False, True],
        'rag':['r', 'r', 'g', 'a', 'r'],
        'group':['high', 'low', 'high', 'medium', 'high']}

df = pd.DataFrame(data)

for checked column, you should enter value as True/False without in quotation. Otherwise, python will interpret True'/'Fasle' as string.

The idea here is two steps: (1) you use iteration with groupby. (2) then you merge/concat them into 1 table:

# Create empty table
table = pd.DataFrame()

# List of column you want to iterate:
col_iter = ['checked', 'rag']

# Iterate:
for col in col_iter:
    # Obtain unique values in each col used
    uni = df[col].unique()
    
    # Iterate for each unique value in col.
    # Set tem var
    # Concat tem to table
    for val in uni:
        tem = df.groupby('group').apply(lambda g: (g[col]==val).sum())
        table = pd.concat([table, tem], axis=1).rename(columns={0:f'{col}={val}'})