How to weigh values based on a variable in R?-CodePudding

So i have data as such

df <- structure(list(USER = c(1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5), 
    timestamp = structure(c(1614179957.06, 1614181158, 1614181757, 
    1614181938, 1614185926, 1614185987, 1614196768.466, 1614205951.597, 
    1614206076, 1614210969.716, 1614210971.501, 1614210977.449
    ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), source = c("A", 
    "B", "A", "B", "B", "V", "C", "A", "A", "B", "H", "A"), event = c(NA, 
    NA, NA, "y", NA, "y", "y", NA, "y", NA, NA,"y")), row.names = c(NA, 
-12L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001f74a701ef0>, sorted = c("USER", 
"timestamp"))

Looks like this...

USER    timestamp   source  event
1   2/24/21 15:19   A   
1   2/24/21 15:39   B   
1   2/24/21 15:49   A   
1   2/24/21 15:52   B       y
2   2/24/21 16:58   B   
2   2/24/21 16:59   V       y
3   2/24/21 19:59   C       y
4   2/24/21 22:32   A   
4   2/24/21 22:34   A       y
5   2/24/21 23:56   B   
5   2/24/21 23:57   H       
5   2/24/21 23:58   A       y

And I'm basically trying to assign weigh to each of the sources. So I would like to create a new column using dplyr which weighs the row based on how far it is from the event (y)

So specifically I want to see this two ways:

The closer the row is to the event row, the more it's worth, effectively a time decay. The row of which event = y is the row which occured closest to the event.

So in the case of USER =1, you see.....

USER    timestamp   source  event  weight
1   2/24/21 15:19   A              .1
1   2/24/21 15:39   B              .2
1   2/24/21 15:49   A              .3
1   2/24/21 15:52   B       y      .4

The first and last event weigh the most, everything else is lower weight (U-shaped)

So in the case of USER =1, you see.....

USER    timestamp   source  event  weight
1   2/24/21 15:19   A              .4
1   2/24/21 15:39   B              .1
1   2/24/21 15:49   A              .1
1   2/24/21 15:52   B       y      .4

CodePudding user response：

The code below creates weights columns with both criteria.
For the second way it uses an auxiliary function f, in order to make the code more readable.

library(dplyr)

f <- function(n){
  if(n > 1L){
    m <- c(n, rep(1L, n - 2L), n)
    m/sum(m)
  } else 1L
}
df %>%
  group_by(USER) %>%
  mutate(weight1 = seq_along(event)/sum(seq_along(event)),
         weight2 = f(n()))
## A tibble: 12 x 6
## Groups:   USER [5]
#    USER timestamp           source event weight1 weight2
#   <dbl> <dttm>              <chr>  <chr>   <dbl>   <dbl>
# 1     1 2021-02-24 15:19:17 A      NA      0.1     0.4  
# 2     1 2021-02-24 15:39:18 B      NA      0.2     0.1  
# 3     1 2021-02-24 15:49:17 A      NA      0.3     0.1  
# 4     1 2021-02-24 15:52:18 B      y       0.4     0.4  
# 5     2 2021-02-24 16:58:46 B      NA      0.333   0.5  
# 6     2 2021-02-24 16:59:47 V      y       0.667   0.5  
# 7     3 2021-02-24 19:59:28 C      y       1       1    
# 8     4 2021-02-24 22:32:31 A      NA      0.333   0.5  
# 9     4 2021-02-24 22:34:36 A      y       0.667   0.5  
#10     5 2021-02-24 23:56:09 B      NA      0.167   0.429
#11     5 2021-02-24 23:56:11 H      NA      0.333   0.143
#12     5 2021-02-24 23:56:17 A      y       0.5     0.429

CodePudding user response：

We may also use arrange to arrange the rows by timestamp and then create the column by dividing the row_number() by 10 with a logical condition in case_when

library(dplyr)
df %>% 
    arrange(USER, timestamp, event %in% 'y') %>%
    group_by(USER) %>% 
    mutate(weight = case_when(row_number() <= match('y', event) ~ row_number()/10))