Suppose I have this dataframe, df, in R:
UserID <- c(1, 1, 1, 5, 5, 7, 7, 9, 9, 9)
PathID <- c(1,2,3,1,2,1,2,1,2,3)
Page <- c("home", "about", "services", "home", "pricing", "pricing", "home", "about", "home", "services")
df <- data.frame(UserID, PathID, Page)
I would like to add a column called "Set" which is an index of the combination of sequences.
So, my output should look like this:
UserID <- c(1, 1, 1, 5, 5, 7, 7, 9, 9, 9)
PathID <- c(1,2,3,1,2,1,2,1,2,3)
Page <- c("home", "about", "services", "home", "pricing", "pricing", "home", "about", "home", "services")
Set <- c(1,1,1,2,2,2,2,1,1,1)
df1 <- data.frame(UserID, PathID, Page, Set)
I would really appreciate some help here.
CodePudding user response:
A data.table
option using as.factor
> setDT(df)[, Set := toString(sort(Page)), UserID][, Set := as.integer(as.factor(Set))][]
UserID PathID Page Set
1: 1 1 home 1
2: 1 2 about 1
3: 1 3 services 1
4: 5 1 home 2
5: 5 2 pricing 2
6: 7 1 pricing 2
7: 7 2 home 2
8: 9 1 about 1
9: 9 2 home 1
10: 9 3 services 1
A similar base R impementation is
> transform(df, Set = as.integer(as.factor(ave(Page,UserID,FUN = function(x) toString(sort(x))))))
UserID PathID Page Set
1 1 1 home 1
2 1 2 about 1
3 1 3 services 1
4 5 1 home 2
5 5 2 pricing 2
6 7 1 pricing 2
7 7 2 home 2
8 9 1 about 1
9 9 2 home 1
10 9 3 services 1
CodePudding user response:
A possible solution:
library(tidyverse)
df %>%
group_by(UserID) %>%
summarise(Set = str_c(sort(Page), collapse = ",")) %>%
group_by(Set) %>%
mutate(Set = cur_group_id()) %>%
ungroup %>%
right_join(df) %>%
relocate(Set, .after = Page)
#> Joining, by = "UserID"
#> # A tibble: 10 × 4
#> UserID PathID Page Set
#> <dbl> <dbl> <chr> <int>
#> 1 1 1 home 1
#> 2 1 2 about 1
#> 3 1 3 services 1
#> 4 5 1 home 2
#> 5 5 2 pricing 2
#> 6 7 1 pricing 2
#> 7 7 2 home 2
#> 8 9 1 about 1
#> 9 9 2 home 1
#> 10 9 3 services 1