So i have data as such
df <- structure(list(USER = c(1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5),
timestamp = structure(c(1614179957.06, 1614181158, 1614181757,
1614181938, 1614185926, 1614185987, 1614196768.466, 1614205951.597,
1614206076, 1614210969.716, 1614210971.501, 1614210977.449
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), source = c("A",
"B", "A", "B", "B", "V", "C", "A", "A", "B", "H", "A"), event = c(NA,
NA, NA, "y", NA, "y", "y", NA, "y", NA, NA,"y")), row.names = c(NA,
-12L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001f74a701ef0>, sorted = c("USER",
"timestamp"))
Looks like this...
USER timestamp source event
1 2/24/21 15:19 A
1 2/24/21 15:39 B
1 2/24/21 15:49 A
1 2/24/21 15:52 B y
2 2/24/21 16:58 B
2 2/24/21 16:59 V y
3 2/24/21 19:59 C y
4 2/24/21 22:32 A
4 2/24/21 22:34 A y
5 2/24/21 23:56 B
5 2/24/21 23:57 H
5 2/24/21 23:58 A y
And I'm basically trying to assign weigh to each of the sources. So I would like to create a new column using dplyr which weighs the row based on how far it is from the event (y)
So specifically I want to see this two ways:
- The closer the row is to the event row, the more it's worth, effectively a time decay. The row of which
event = y
is the row which occured closest to the event.
So in the case of USER =1
, you see.....
USER timestamp source event weight
1 2/24/21 15:19 A .1
1 2/24/21 15:39 B .2
1 2/24/21 15:49 A .3
1 2/24/21 15:52 B y .4
- The first and last event weigh the most, everything else is lower weight (U-shaped)
So in the case of USER =1
, you see.....
USER timestamp source event weight
1 2/24/21 15:19 A .4
1 2/24/21 15:39 B .1
1 2/24/21 15:49 A .1
1 2/24/21 15:52 B y .4
CodePudding user response:
The code below creates weights columns with both criteria.
For the second way it uses an auxiliary function f
, in order to make the code more readable.
library(dplyr)
f <- function(n){
if(n > 1L){
m <- c(n, rep(1L, n - 2L), n)
m/sum(m)
} else 1L
}
df %>%
group_by(USER) %>%
mutate(weight1 = seq_along(event)/sum(seq_along(event)),
weight2 = f(n()))
## A tibble: 12 x 6
## Groups: USER [5]
# USER timestamp source event weight1 weight2
# <dbl> <dttm> <chr> <chr> <dbl> <dbl>
# 1 1 2021-02-24 15:19:17 A NA 0.1 0.4
# 2 1 2021-02-24 15:39:18 B NA 0.2 0.1
# 3 1 2021-02-24 15:49:17 A NA 0.3 0.1
# 4 1 2021-02-24 15:52:18 B y 0.4 0.4
# 5 2 2021-02-24 16:58:46 B NA 0.333 0.5
# 6 2 2021-02-24 16:59:47 V y 0.667 0.5
# 7 3 2021-02-24 19:59:28 C y 1 1
# 8 4 2021-02-24 22:32:31 A NA 0.333 0.5
# 9 4 2021-02-24 22:34:36 A y 0.667 0.5
#10 5 2021-02-24 23:56:09 B NA 0.167 0.429
#11 5 2021-02-24 23:56:11 H NA 0.333 0.143
#12 5 2021-02-24 23:56:17 A y 0.5 0.429
CodePudding user response:
We may also use arrange
to arrange the rows by timestamp and then create the column by dividing the row_number()
by 10 with a logical condition in case_when
library(dplyr)
df %>%
arrange(USER, timestamp, event %in% 'y') %>%
group_by(USER) %>%
mutate(weight = case_when(row_number() <= match('y', event) ~ row_number()/10))