Indexing Combination of Sequences in R-CodePudding

Suppose I have this dataframe, df, in R:

 UserID <- c(1, 1, 1, 5, 5, 7, 7, 9, 9, 9)
 PathID <- c(1,2,3,1,2,1,2,1,2,3)
 Page <- c("home", "about", "services", "home", "pricing", "pricing", "home", "about", "home", "services")
 df <- data.frame(UserID, PathID, Page)

I would like to add a column called "Set" which is an index of the combination of sequences.

So, my output should look like this:

 UserID <- c(1, 1, 1, 5, 5, 7, 7, 9, 9, 9)
 PathID <- c(1,2,3,1,2,1,2,1,2,3)
 Page <- c("home", "about", "services", "home", "pricing", "pricing", "home", "about", "home", "services")
Set <- c(1,1,1,2,2,2,2,1,1,1)
 df1 <- data.frame(UserID, PathID, Page, Set)

I would really appreciate some help here.

CodePudding user response：

A data.table option using as.factor

> setDT(df)[, Set := toString(sort(Page)), UserID][, Set := as.integer(as.factor(Set))][]
    UserID PathID     Page Set
 1:      1      1     home   1
 2:      1      2    about   1
 3:      1      3 services   1
 4:      5      1     home   2
 5:      5      2  pricing   2
 6:      7      1  pricing   2
 7:      7      2     home   2
 8:      9      1    about   1
 9:      9      2     home   1
10:      9      3 services   1

A similar base R impementation is

> transform(df, Set = as.integer(as.factor(ave(Page,UserID,FUN = function(x) toString(sort(x))))))
   UserID PathID     Page Set
1       1      1     home   1
2       1      2    about   1
3       1      3 services   1
4       5      1     home   2
5       5      2  pricing   2
6       7      1  pricing   2
7       7      2     home   2
8       9      1    about   1
9       9      2     home   1
10      9      3 services   1

CodePudding user response：

A possible solution:

library(tidyverse)

df %>%
    group_by(UserID) %>%
    summarise(Set = str_c(sort(Page), collapse = ",")) %>%
    group_by(Set) %>%
    mutate(Set = cur_group_id()) %>%
    ungroup %>%
    right_join(df) %>%
    relocate(Set, .after = Page)

#> Joining, by = "UserID"
#> # A tibble: 10 × 4
#>    UserID PathID Page       Set
#>     <dbl>  <dbl> <chr>    <int>
#>  1      1      1 home         1
#>  2      1      2 about        1
#>  3      1      3 services     1
#>  4      5      1 home         2
#>  5      5      2 pricing      2
#>  6      7      1 pricing      2
#>  7      7      2 home         2
#>  8      9      1 about        1
#>  9      9      2 home         1
#> 10      9      3 services     1