I have a dataset:
client_id cohort year which_year frequency
7 2006 2011 6 10
7 2006 2012 7 11
7 2006 2013 8 17
17 2006 2011 6 9
17 2006 2012 7 4
17 2006 2013 8 7
19 2015 2015 1 2
19 2015 2016 2 20
19 2015 2017 3 30
I would like to create a variable "frequency_start" which takes the frequency at the minimum which_year for a client_id and replicates it among all rows where client_id = i. Like this:
client_id cohort year which_year frequency frequency_start
7 2006 2011 6 10 10
7 2006 2012 7 11 10
7 2006 2013 8 17 10
17 2006 2011 6 9 9
17 2006 2012 7 4 9
17 2006 2013 8 7 9
19 2015 2015 1 2 2
19 2015 2016 2 20 2
19 2015 2017 3 30 2
I've tried for loops with lapply and match with which.min but nothing worked properly. Do you have any idea what might help? I would be very, very grateful.
CodePudding user response:
You may use which.min
to get minimum index of which_year
and get corresponding frequency
value for each client_id
.
library(dplyr)
df <- df %>%
group_by(client_id) %>%
mutate(frequency_start = frequency[which.min(which_year)]) %>%
ungroup
df
# client_id cohort year which_year frequency frequency_start
# <int> <int> <int> <int> <int> <int>
#1 7 2006 2011 6 10 10
#2 7 2006 2012 7 11 10
#3 7 2006 2013 8 17 10
#4 17 2006 2011 6 9 9
#5 17 2006 2012 7 4 9
#6 17 2006 2013 8 7 9
#7 19 2015 2015 1 2 2
#8 19 2015 2016 2 20 2
#9 19 2015 2017 3 30 2
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(client_id = c(7L, 7L, 7L, 17L, 17L, 17L, 19L,
19L, 19L), cohort = c(2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2015L, 2015L, 2015L), year = c(2011L, 2012L, 2013L, 2011L, 2012L,
2013L, 2015L, 2016L, 2017L), which_year = c(6L, 7L, 8L, 6L, 7L,
8L, 1L, 2L, 3L), frequency = c(10L, 11L, 17L, 9L, 4L, 7L, 2L,
20L, 30L)), class = "data.frame", row.names = c(NA, -9L))
CodePudding user response:
Base R:
dat$frequency_start <- dat$frequency[ave(
seq_len(nrow(dat)), dat$client_id,
FUN = function(i) i[which.min(dat$year[i])]
)]
dat
# client_id cohort year which_year frequency frequency_start
# 1 7 2006 2011 6 10 10
# 2 7 2006 2012 7 11 10
# 3 7 2006 2013 8 17 10
# 4 17 2006 2011 6 9 9
# 5 17 2006 2012 7 4 9
# 6 17 2006 2013 8 7 9
# 7 19 2015 2015 1 2 2
# 8 19 2015 2016 2 20 2
# 9 19 2015 2017 3 30 2
Usually ave
is given a data column as its first argument. However, since we want the frequency
at the minimum year
(two columns), I thought to pass the row indices instead. Similarly, since ave
always returns the class
of its first argument (regardless of what the inner function returns), the anon-function also returns an integer, which is then used to index the original frequency
column. While with this sample data it appears that frequency
is an integer, for all I know it's near-integer floating-point, which would be broken by trusting ave
to return what we want. By indexing on the return from ave
, we will always store the correct class/type of variable into frequency_start
.