Group by in Pandas based on a condition-CodePudding

I have a dataframe

|phone_number|call_date|answered| attempt|
|123        | 13thJune| 1 | 1 |
|234        | 15thJune| 0 | 1 |
|234        | 15thJune| 0 | 2 |

I want to perform a groupby and take out the max date of answered. i.e If the call is not answered which is 0 , then max date of answered should be blank.

df.groupby(['phone_number'])['Call_Date'].max().reset_index()

only when answered is > 0 else this groupby should give me a blank

How do I achieve this?

expected df

phone_number | max_call_date 
123 | 13th June
234 | Nan

CodePudding user response：

Use:

df = df.groupby('phone_number').apply(lambda x: x[x['answered']!=0]['call_date'].max()).reset_index().rename(columns={0: 'max_call_date'})
print(df)

Output:

   phone_number max_call_date
0           123      13thJune
1           234           NaN

CodePudding user response：

First idea is filter out rows with not 0 in answered, aggregate max and add filtered phone_number like NaNs by Series.reindex:

df1 = (df[df['answered'].ne(0)]
          .groupby(['phone_number'])['call_date']
          .max()
          .reindex(df['phone_number'].unique())
          .reset_index(name='max_call_date'))
print (df1)
   phone_number max_call_date
0           123      13thJune
1           234           NaN

Or replace call_date to missing values if answered=0 and then aggregate max:

df1 = (df.assign(call_date = df['call_date'].mask(df['answered'].eq(0)))
         .groupby(['phone_number'])['call_date'].max()
         .reset_index(name='max_call_date'))
print (df1)
   phone_number max_call_date
0           123      13thJune
1           234           NaN

Last idea if need set NaN if at least one value answered=0 and minimal value of column answered=0 is:

df1 = df.groupby('phone_number', as_index=False).agg({'call_date':'max', 'answered':'min'})

df1['max_call_date'] = df1.pop('call_date').mask(df1.pop('answered').eq(0))
print (df1)
   phone_number max_call_date
0           123      13thJune
1           234           NaN

EDIT: For correct maximal datetime from strings is necessary convert column to datetimes:

df['call_date'] = pd.to_datetime(df['call_date'].str.replace('st|nd|rd|th',' ',regex=True), 
                                 format='%d %B')


df1 = (df[df['answered'].ne(0)]
          .groupby(['phone_number'])['call_date']
          .max()
          .reindex(df['phone_number'].unique())
          .reset_index(name='max_call_date'))
print (df1)
   phone_number max_call_date
0           123    1900-06-13
1           234           NaT

df1 = (df.assign(call_date = df['call_date'].mask(df['answered'].eq(0)))
         .groupby(['phone_number'])['call_date'].max()
         .reset_index(name='max_call_date'))
print (df1)
   phone_number max_call_date
0           123    1900-06-13
1           234           NaT

df1 = df.groupby('phone_number', as_index=False).agg({'call_date':'max', 'answered':'min'})

df1['max_call_date'] = df1.pop('call_date').mask(df1.pop('answered').eq(0))
print (df1)
   phone_number max_call_date
0           123    1900-06-13
1           234           NaT

CodePudding user response：

You could use:

(df.groupby('phone_number', as_index=False)
   .agg({'call_date':'max', 'answered':'min'})
   .assign(call_date=lambda d: d['call_date'].where(df['answered'].eq(1)))
   .drop('answered', axis=1)
 )

Output:

   phone_number call_date
0           123  13thJune
1           234       NaN