I have a data frame as follows and would like to create a time spread variable based on max spread of events.
A<- c('1244', '1244', '1244', '1245', '1245', '1245', '1245', '1245', '1245', '1245')
sequence<- c(1,1,0, 1,1,0,0,1,1,1)
# 1= lived locally and 0 lived internationally
date<- c('19/Oct/12', '19/Oct/12', '20/Oct/12', '19/Oct/11', '19/Oct/11', '22/Nov/12',
'24/Nov/12', '29/Nov/12','2/Dec/12', '29/Dec/12')
df<- data.frame(A,sequence, date)
I would like to calculate the average time spread of these people in locations both locally and internationally. For example if we see the person 1244 she/he has sequence 110 (i.e., they lived in two spells in locally and 1 spell internationally and if we want to calculate the time spread the avg_local_timespread= total time in all spells/count of events in spells which translates to 0 day/2(which is difference between the last date- start date (so, 0 days= 19/Oct/12-19/Oct/12) and avg_international_timespread= 0/1=0 (20/Oct/12-20/Oct/12)
And for person 1245 it is 1100111 so they have lived locally in two spells, the first spell is 11 and second one is 111. As I want to get the average time spread locally in this case it is the time difference between start date and final date for first spell(11) last spell(111)/total spells (5). Hence avg_local_timespread= 6 days (30 0)/5 and avg_international_timespread= 1 day (2/2)('24/Nov/12'-'22/Nov/12')
I am expecting an output as follows:
A avg_local_timespread avg_international_timespread total_local_timespread
1244 0 0 0
1245 6 1 30
total_international_timespread
1244 0
1245 2
CodePudding user response:
I cannot replicate your desired output becasue i do not understand the calculation you provide in your question. But I believe this will get you started.
library(data.table)
library(lubridate)
# convert to data.table format
setDT(df)
# convert dates to date-format
df[, date := lubridate::dmy(date, locale = "English_United States")]
# create sequence-groups by A
df[, seq_group := rleid(sequence == 1), by = .(A)][]
# summarise
ans <- df[, .(start = min(date), end = max(date)), by = .(A, seq_group, sequence)]
# add diration of spell
ans[, duration := end - start]
# A seq_group sequence start end duration
# 1: 1244 1 1 2012-10-19 2012-10-19 0 days
# 2: 1244 2 0 2012-10-20 2012-10-20 0 days
# 3: 1245 1 1 2011-10-19 2011-10-19 0 days
# 4: 1245 2 0 2012-11-22 2012-11-24 2 days
# 5: 1245 3 1 2012-11-29 2012-12-29 30 days