I have a dataframe:
> df = batch Code. time
> a 100. 2019-08-01 00:59:12.000
> a 120. 2019-08-01 00:59:32.000
> a 130. 2019-08-01 00:59:42.000
> a 120. 2019-08-01 00:59:52.000
> b 100. 2019-08-01 00:44:11.000
> b 140. 2019-08-02 00:14:11.000
> b 150. 2019-08-03 00:47:11.000
> c 150. 2019-09-01 00:44:11.000
> d 100. 2019-08-01 00:10:00.000
> d 100. 2019-08-01 00:10:05.000
> d 130. 2019-08-01 00:10:10.000
> d 130. 2019-08-01 00:10:20.000
I want to get the number of seconds, per group, between the time of the first '100' code to the last '130' code. If for a group there is no code 100 with code 130 after (one of them is missing) - put nan. So the output should be:
df2 = batch duration
a 30
b. nan
c. nan
d. 20
What is the best way to do it?
CodePudding user response:
Use:
#convert values to datetimes
df['time'] = pd.to_datetime(df['time'])
#get first 100 Code per batch
s1=df[df['Code.'].eq(100)].drop_duplicates('batch').set_index('batch')['time']
#get last 130 Code per batch
s2=df[df['Code.'].eq(130)].drop_duplicates('batch', keep='last').set_index('batch')['time']
#subtract and convert to timedeltas
df = (s2.sub(s1)
.dt.total_seconds()
.reindex(df['batch'].unique())
.reset_index(name='duration'))
print (df)
batch duration
0 a 30.0
1 b NaN
2 c NaN
3 d 20.0
CodePudding user response:
As an alternative:
batchs = pd.DataFrame(df['batch'].unique(),columns=['batch'])
df = df[(df['code'] == 100) | (df['code']==130)]
final=pd.concat([
df.drop_duplicates(subset='code',keep='first'),
df.drop_duplicates(subset='code',keep='last'),
])
final['duration'] = (final['time'].shift(-1) - final['time']).dt.total_seconds()
final = final.drop_duplicates('batch',keep='first').drop(['time','code'],axis=1).merge(batchs,on='batch',how='right')
final
batch duration
0 a 30.0
1 b nan
2 c nan
3 d 15.0