Home > Software engineering >  How to obtain dataframe from grouped element after using apply
How to obtain dataframe from grouped element after using apply

Time:04-04

Let's say this the dataframe:

First Name  Last Name   Val1 
George      Clooney      N     
George      Clooney      N   
George      Clooney      Y
George      Freeman      N
George      Freeman      N
Claire      Stark        N
Claire      Stark        Y   

Then the goal is to produce this:

First Name  Last Name   Val1  Total
George      Clooney      Y      3
George      Freeman      N      2
Claire      Stark        Y      2

The total Val1 is Y as long as one of the instances is Y.

My code looks like this:

grouped = df.groupby(by=['First Name', 'Last Name'])

def val_func(x):
    if (x['Val1'] == 'Y').any():
    return 'Y'
    else:
    return 'N'

cumulative = grouped.apply(val_func)

This works except that cumulative has dtype object and I can only access Val1, that is, I cannot access First Name or Last Name (Although when I run print(cumulative), it does print everything).

If I try:

df_cumulative = pd.DataFrame(cumulative)

then, I just get the column with Y or N, but not the names.

How to fix this? Moreover, can I return two arguments? one for Val1 and one for Total? or would I have to run another apply for Total and append the column to the dataframe?

CodePudding user response:

Another way is to use groupby.agg where you use max to get "Y" if it exists (because Y>N) and count:

out = df.groupby(['First Name', 'Last Name'], sort=False, as_index=False)\
        .agg(Val1=('Val1', 'max'), Total=('Val1', 'count'))

Output:

  First Name Last Name Val1  Total
0     George   Clooney    Y      3
1     George   Freeman    N      2
2     Claire     Stark    Y      2

You can pass in a lambda that selects based whatever criteria you want. For example, the following aggregates "Val1" based on whether the number of "Y"s are greater than the number of "N"s (if there are more "Y"s select "Y" else "N"):

out = df.groupby(['First Name', 'Last Name'], sort=False, as_index=False)\
        .agg(Val1=('Val1', lambda x: 'Y' if x.eq('Y').sum() > x.eq('N').sum() else 'N'), 
             Total=('Val1', 'count'))

CodePudding user response:

Here is a way:

(df.sort_values('Val1')
 .groupby(['First Name','Last Name'])
 .agg(Val1 = ('Val1','last'),count = ('Val1','count'))
 .reset_index())

Output:

  First Name Last Name Val1  count
0     Claire     Stark    Y      2
1     George   Clooney    Y      3
2     George   Freeman    N      2
  • Related