I am trying to create a path sequence. The following is a sample dataset:
df <- structure(list(
sess_id = c(4, 4, 4, 4, 4, 4, 4, 7, 7, 7, 7, 7),
Page = c("A", "B", "C", "D", "A", "C", "B", "B", "C", "D", "A", "D")),
.Names = c("sess_id", "Page"),
row.names = c(NA, -12L),
class = "data.frame")
This is the table:
sess_id | Page |
---|---|
4 | A |
4 | B |
4 | C |
4 | D |
4 | A |
4 | C |
4 | B |
7 | B |
7 | C |
7 | D |
7 | A |
7 | D |
I would like to add three columns like so:
sess_id | Page | Path | Start | End |
---|---|---|---|---|
4 | A | |||
4 | B | AB | A | B |
4 | C | ABC | A | C |
4 | D | ABCD | A | D |
4 | A | ABCDA | A | A |
4 | C | BCDAC | B | C |
4 | B | CDACB | C | B |
7 | B | |||
7 | C | BC | B | C |
7 | D | BCD | B | D |
7 | A | BCDA | B | A |
7 | D | BCDAD | B | D |
I am trying to create a path sequence of five pages in each session. And map the start and end of that five-page sequence.
CodePudding user response:
Use rollapplyr
from package zoo
to create a rolling sequence per group of sess_id
. Then the 1st and the last characters of the sequences are the Start
and End
columns, respectively.
df <- structure(list(
sess_id = c(4, 4, 4, 4, 4, 4, 4, 7, 7, 7, 7, 7),
Page = c("A", "B", "C", "D", "A", "C", "B", "B", "C", "D", "A", "D")),
.Names = c("sess_id", "Page"),
row.names = c(NA, -12L),
class = "data.frame")
fun <- function(x, width) {
y1 <- zoo::rollapplyr(x, width = seq(width), paste, collapse = "")[1:(width - 1L)]
y2 <- zoo::rollapplyr(x, width = width, paste, collapse = "")
c(y1, y2)
}
sp <- split(df$Page, df$sess_id)
l <- 5L
df$Path <- unlist(lapply(sp, fun, width = l))
df$Start <- substr(df$Path, 1, 1)
df$End <- substring(df$Path, nchar(df$Path))
df
#> sess_id Page Path Start End
#> 1 4 A A A A
#> 2 4 B AB A B
#> 3 4 C ABC A C
#> 4 4 D ABCD A D
#> 5 4 A ABCDA A A
#> 6 4 C BCDAC B C
#> 7 4 B CDACB C B
#> 8 7 B B B B
#> 9 7 C BC B C
#> 10 7 D BCD B D
#> 11 7 A BCDA B A
#> 12 7 D BCDAD B D
Created on 2022-11-08 with reprex v2.0.2
CodePudding user response:
You can use accumulate
substr
like below
library(dplyr)
library(purrr)
df %>%
group_by(sess_id) %>%
mutate(Path = accumulate(Page, paste0)) %>%
ungroup() %>%
mutate(
Path = substr(Path, nchar(Path) - 4, nchar(Path)),
Start = substr(Path, 1, 1),
End = Page
)
which gives
# A tibble: 12 × 5
sess_id Page Path Start End
<dbl> <chr> <chr> <chr> <chr>
1 4 A A A A
2 4 B AB A B
3 4 C ABC A C
4 4 D ABCD A D
5 4 A ABCDA A A
6 4 C BCDAC B C
7 4 B CDACB C B
8 7 B B B B
9 7 C BC B C
10 7 D BCD B D
11 7 A BCDA B A
12 7 D BCDAD B D
CodePudding user response:
The following works and uses tidyverse. Path
is first created as all letters within each sess_id
stuck together. Then take the first to nth letters, where n is the row number. Then take between 0 and 5 chars from the end of string.
The Start
and End
are just the first and last letters of Path
.
At the end we set Path
, Start
and End
to ""
when the length of Path
is one.
df <- df %>%
group_by(sess_id) %>%
mutate(Path = paste0(Page , collapse = "") %>%
str_sub( 1 , row_number()) %>%
str_extract( "\\w{0,5}$"),
Start = str_extract(Path , "^\\w"),
End = str_extract(Path , "\\w$")) %>%
mutate(across(c(Path, Start, End), ~if_else(str_length(Path)==1 , "" , .)))
> df
# A tibble: 12 x 5
# Groups: sess_id [2]
sess_id Page Path Start End
<dbl> <chr> <chr> <chr> <chr>
1 4 A "" "" ""
2 4 B "AB" "A" "B"
3 4 C "ABC" "A" "C"
4 4 D "ABCD" "A" "D"
5 4 A "ABCDA" "A" "A"
6 4 C "BCDAC" "B" "C"
7 4 B "CDACB" "C" "B"
8 7 B "" "" ""
9 7 C "BC" "B" "C"
10 7 D "BCD" "B" "D"
11 7 A "BCDA" "B" "A"
12 7 D "BCDAD" "B" "D"