Home > Enterprise >  How to weigh values based on a variable in R?
How to weigh values based on a variable in R?

Time:10-05

So i have data as such

df <- structure(list(USER = c(1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5), 
    timestamp = structure(c(1614179957.06, 1614181158, 1614181757, 
    1614181938, 1614185926, 1614185987, 1614196768.466, 1614205951.597, 
    1614206076, 1614210969.716, 1614210971.501, 1614210977.449
    ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), source = c("A", 
    "B", "A", "B", "B", "V", "C", "A", "A", "B", "H", "A"), event = c(NA, 
    NA, NA, "y", NA, "y", "y", NA, "y", NA, NA,"y")), row.names = c(NA, 
-12L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001f74a701ef0>, sorted = c("USER", 
"timestamp"))

Looks like this...

USER    timestamp   source  event
1   2/24/21 15:19   A   
1   2/24/21 15:39   B   
1   2/24/21 15:49   A   
1   2/24/21 15:52   B       y
2   2/24/21 16:58   B   
2   2/24/21 16:59   V       y
3   2/24/21 19:59   C       y
4   2/24/21 22:32   A   
4   2/24/21 22:34   A       y
5   2/24/21 23:56   B   
5   2/24/21 23:57   H       
5   2/24/21 23:58   A       y

And I'm basically trying to assign weigh to each of the sources. So I would like to create a new column using dplyr which weighs the row based on how far it is from the event (y)

So specifically I want to see this two ways:

  1. The closer the row is to the event row, the more it's worth, effectively a time decay. The row of which event = y is the row which occured closest to the event.

So in the case of USER =1, you see.....

USER    timestamp   source  event  weight
1   2/24/21 15:19   A              .1
1   2/24/21 15:39   B              .2
1   2/24/21 15:49   A              .3
1   2/24/21 15:52   B       y      .4
  1. The first and last event weigh the most, everything else is lower weight (U-shaped)

So in the case of USER =1, you see.....

USER    timestamp   source  event  weight
1   2/24/21 15:19   A              .4
1   2/24/21 15:39   B              .1
1   2/24/21 15:49   A              .1
1   2/24/21 15:52   B       y      .4

CodePudding user response:

The code below creates weights columns with both criteria.
For the second way it uses an auxiliary function f, in order to make the code more readable.

library(dplyr)

f <- function(n){
  if(n > 1L){
    m <- c(n, rep(1L, n - 2L), n)
    m/sum(m)
  } else 1L
}
df %>%
  group_by(USER) %>%
  mutate(weight1 = seq_along(event)/sum(seq_along(event)),
         weight2 = f(n()))
## A tibble: 12 x 6
## Groups:   USER [5]
#    USER timestamp           source event weight1 weight2
#   <dbl> <dttm>              <chr>  <chr>   <dbl>   <dbl>
# 1     1 2021-02-24 15:19:17 A      NA      0.1     0.4  
# 2     1 2021-02-24 15:39:18 B      NA      0.2     0.1  
# 3     1 2021-02-24 15:49:17 A      NA      0.3     0.1  
# 4     1 2021-02-24 15:52:18 B      y       0.4     0.4  
# 5     2 2021-02-24 16:58:46 B      NA      0.333   0.5  
# 6     2 2021-02-24 16:59:47 V      y       0.667   0.5  
# 7     3 2021-02-24 19:59:28 C      y       1       1    
# 8     4 2021-02-24 22:32:31 A      NA      0.333   0.5  
# 9     4 2021-02-24 22:34:36 A      y       0.667   0.5  
#10     5 2021-02-24 23:56:09 B      NA      0.167   0.429
#11     5 2021-02-24 23:56:11 H      NA      0.333   0.143
#12     5 2021-02-24 23:56:17 A      y       0.5     0.429

CodePudding user response:

We may also use arrange to arrange the rows by timestamp and then create the column by dividing the row_number() by 10 with a logical condition in case_when

library(dplyr)
df %>% 
    arrange(USER, timestamp, event %in% 'y') %>%
    group_by(USER) %>% 
    mutate(weight = case_when(row_number() <= match('y', event) ~ row_number()/10))
  • Related