I have a dataframe
|phone_number|call_date|answered| attempt|
|123 | 13thJune| 1 | 1 |
|234 | 15thJune| 0 | 1 |
|234 | 15thJune| 0 | 2 |
I want to perform a groupby and take out the max date of answered. i.e If the call is not answered which is 0 , then max date of answered should be blank.
df.groupby(['phone_number'])['Call_Date'].max().reset_index()
only when answered is > 0
else this groupby should give me a blank
How do I achieve this?
expected df
phone_number | max_call_date
123 | 13th June
234 | Nan
CodePudding user response:
Use:
df = df.groupby('phone_number').apply(lambda x: x[x['answered']!=0]['call_date'].max()).reset_index().rename(columns={0: 'max_call_date'})
print(df)
Output:
phone_number max_call_date
0 123 13thJune
1 234 NaN
CodePudding user response:
First idea is filter out rows with not 0
in answered
, aggregate max
and add filtered phone_number
like NaN
s by Series.reindex
:
df1 = (df[df['answered'].ne(0)]
.groupby(['phone_number'])['call_date']
.max()
.reindex(df['phone_number'].unique())
.reset_index(name='max_call_date'))
print (df1)
phone_number max_call_date
0 123 13thJune
1 234 NaN
Or replace call_date
to missing values if answered=0
and then aggregate max
:
df1 = (df.assign(call_date = df['call_date'].mask(df['answered'].eq(0)))
.groupby(['phone_number'])['call_date'].max()
.reset_index(name='max_call_date'))
print (df1)
phone_number max_call_date
0 123 13thJune
1 234 NaN
Last idea if need set NaN
if at least one value answered=0
and minimal value of column answered=0
is:
df1 = df.groupby('phone_number', as_index=False).agg({'call_date':'max', 'answered':'min'})
df1['max_call_date'] = df1.pop('call_date').mask(df1.pop('answered').eq(0))
print (df1)
phone_number max_call_date
0 123 13thJune
1 234 NaN
EDIT: For correct maximal datetime from strings is necessary convert column to datetimes:
df['call_date'] = pd.to_datetime(df['call_date'].str.replace('st|nd|rd|th',' ',regex=True),
format='%d %B')
df1 = (df[df['answered'].ne(0)]
.groupby(['phone_number'])['call_date']
.max()
.reindex(df['phone_number'].unique())
.reset_index(name='max_call_date'))
print (df1)
phone_number max_call_date
0 123 1900-06-13
1 234 NaT
df1 = (df.assign(call_date = df['call_date'].mask(df['answered'].eq(0)))
.groupby(['phone_number'])['call_date'].max()
.reset_index(name='max_call_date'))
print (df1)
phone_number max_call_date
0 123 1900-06-13
1 234 NaT
df1 = df.groupby('phone_number', as_index=False).agg({'call_date':'max', 'answered':'min'})
df1['max_call_date'] = df1.pop('call_date').mask(df1.pop('answered').eq(0))
print (df1)
phone_number max_call_date
0 123 1900-06-13
1 234 NaT
CodePudding user response:
You could use:
(df.groupby('phone_number', as_index=False)
.agg({'call_date':'max', 'answered':'min'})
.assign(call_date=lambda d: d['call_date'].where(df['answered'].eq(1)))
.drop('answered', axis=1)
)
Output:
phone_number call_date
0 123 13thJune
1 234 NaN