Home > Back-end >  Replicating one value among multiple rows R
Replicating one value among multiple rows R

Time:09-22

I have a dataset:

client_id cohort year which_year frequency
    7      2006  2011     6         10
    7      2006  2012     7         11  
    7      2006  2013     8         17
    17     2006  2011     6         9
    17     2006  2012     7         4
    17     2006  2013     8         7
    19     2015  2015     1         2
    19     2015  2016     2         20
    19     2015  2017     3         30

I would like to create a variable "frequency_start" which takes the frequency at the minimum which_year for a client_id and replicates it among all rows where client_id = i. Like this:

client_id cohort year which_year frequency  frequency_start
    7      2006  2011     6         10            10 
    7      2006  2012     7         11            10 
    7      2006  2013     8         17            10 
    17     2006  2011     6         9             9
    17     2006  2012     7         4             9 
    17     2006  2013     8         7             9
    19     2015  2015     1         2             2
    19     2015  2016     2         20            2 
    19     2015  2017     3         30            2

I've tried for loops with lapply and match with which.min but nothing worked properly. Do you have any idea what might help? I would be very, very grateful.

CodePudding user response:

You may use which.min to get minimum index of which_year and get corresponding frequency value for each client_id.

library(dplyr)

df <- df %>%
  group_by(client_id) %>%
  mutate(frequency_start = frequency[which.min(which_year)]) %>%
  ungroup

df

#  client_id cohort  year which_year frequency frequency_start
#      <int>  <int> <int>      <int>     <int>           <int>
#1         7   2006  2011          6        10              10
#2         7   2006  2012          7        11              10
#3         7   2006  2013          8        17              10
#4        17   2006  2011          6         9               9
#5        17   2006  2012          7         4               9
#6        17   2006  2013          8         7               9
#7        19   2015  2015          1         2               2
#8        19   2015  2016          2        20               2
#9        19   2015  2017          3        30               2

data

It is easier to help if you provide data in a reproducible format

df <- structure(list(client_id = c(7L, 7L, 7L, 17L, 17L, 17L, 19L, 
19L, 19L), cohort = c(2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 
2015L, 2015L, 2015L), year = c(2011L, 2012L, 2013L, 2011L, 2012L, 
2013L, 2015L, 2016L, 2017L), which_year = c(6L, 7L, 8L, 6L, 7L, 
8L, 1L, 2L, 3L), frequency = c(10L, 11L, 17L, 9L, 4L, 7L, 2L, 
20L, 30L)), class = "data.frame", row.names = c(NA, -9L))

CodePudding user response:

Base R:

dat$frequency_start <- dat$frequency[ave(
  seq_len(nrow(dat)), dat$client_id,
  FUN = function(i) i[which.min(dat$year[i])]
)]
dat
#   client_id cohort year which_year frequency frequency_start
# 1         7   2006 2011          6        10              10
# 2         7   2006 2012          7        11              10
# 3         7   2006 2013          8        17              10
# 4        17   2006 2011          6         9               9
# 5        17   2006 2012          7         4               9
# 6        17   2006 2013          8         7               9
# 7        19   2015 2015          1         2               2
# 8        19   2015 2016          2        20               2
# 9        19   2015 2017          3        30               2

Usually ave is given a data column as its first argument. However, since we want the frequency at the minimum year (two columns), I thought to pass the row indices instead. Similarly, since ave always returns the class of its first argument (regardless of what the inner function returns), the anon-function also returns an integer, which is then used to index the original frequency column. While with this sample data it appears that frequency is an integer, for all I know it's near-integer floating-point, which would be broken by trusting ave to return what we want. By indexing on the return from ave, we will always store the correct class/type of variable into frequency_start.

  • Related