Home > OS >  Summarize collocate frequencies more concisely
Summarize collocate frequencies more concisely

Time:12-03

In this toy dataframe:

df <- structure(list(left = c("a", "a", "a", "a", "a", "a", "b", "b"), 
                     node = c("xxx", "yyy", "xxx", "xxx", "yyy", "yyy", "xxx","yyy"), 
                     right = c("c", "d", "d", "d", "c", "c", "d", "d")), 
                row.names = c(NA, -8L), class = "data.frame")

I'm trying to summarize frequencies of collocates both on the left and on the right of the node forms in a concise way. The way I'm doing it does produce the expected frequencies on both sides:

df_l <- df %>%
  group_by(node, left) %>%
  summarise(n_left = n()) %>%
  arrange(desc(n_left)) %>%
  select(c(3,2,1))

df_r <- df %>%
  group_by(node, right) %>%
  summarise(n_right = n()) %>%
  arrange(desc(n_right))

df_all <- left_join(df_l, df_r, by = "node")
# A tibble: 8 × 5
# Groups:   node [2]
  n_left left  node  right n_right
   <int> <chr> <chr> <chr>   <int>
1      3 a     xxx   d           3
2      3 a     xxx   c           1
3      3 a     yyy   c           2
4      3 a     yyy   d           2
5      1 b     xxx   d           3
6      1 b     xxx   c           1
7      1 b     yyy   c           2
8      1 b     yyy   d           2

The problem is that the way the frquencies are displayed is anything but concise - in fact there are numerous duplicates for the left/n_left and, respectively, right/n_right columns. The output I'm looking for is without any redundancies, like this:

#     n_left left  node  right n_right
#     <int> <chr> <chr> <chr>   <int>
#  1      3 a     xxx   d           3
#  2      1 b     xxx   c           1
#  3      3 a     yyy   c           2
#  4      1 b     yyy   d           2

Any idea how that can be achieved?

CodePudding user response:

Reshaping the data lets you rethink how you can do the aggregation all at once. The select call at the end could certainly be improved upon so it isn't manual—changing names and rearranging based on containing "left", for example—but this should be a start.

library(dplyr)

df %>%
  tidyr::pivot_longer(c(left, right), names_to = "side", values_to = "letter") %>%
  count(node, side, letter) %>%
  tidyr::pivot_wider(id_cols = node, names_from = side, values_from = c(letter, n),
                     values_fn = list) %>%
  tidyr::unnest(-node) %>%
  select(n_left, left = letter_left, node, right = letter_right, n_right)
#> # A tibble: 4 × 5
#>   n_left left  node  right n_right
#>    <int> <chr> <chr> <chr>   <int>
#> 1      3 a     xxx   c           1
#> 2      1 b     xxx   d           3
#> 3      3 a     yyy   c           2
#> 4      1 b     yyy   d           2

CodePudding user response:

This comes close... Perhaps it is good enough for you, else it might inspire someone to build onto.

library(data.table)
setDT(df)
df.melt <- melt(df, id.vars = "node")
dcast(df.melt, node   value ~ paste0("n_", variable), value.var = "value", fun.aggregate = length)
#    node value n_left n_right
# 1:  xxx     a      3       0
# 2:  xxx     b      1       0
# 3:  xxx     c      0       1
# 4:  xxx     d      0       3
# 5:  yyy     a      3       0
# 6:  yyy     b      1       0
# 7:  yyy     c      0       2
# 8:  yyy     d      0       2
  
  • Related