In this toy dataframe:
df <- structure(list(left = c("a", "a", "a", "a", "a", "a", "b", "b"),
node = c("xxx", "yyy", "xxx", "xxx", "yyy", "yyy", "xxx","yyy"),
right = c("c", "d", "d", "d", "c", "c", "d", "d")),
row.names = c(NA, -8L), class = "data.frame")
I'm trying to summarize frequencies of collocates both on the left
and on the right
of the node
forms in a concise way. The way I'm doing it does produce the expected frequencies on both sides:
df_l <- df %>%
group_by(node, left) %>%
summarise(n_left = n()) %>%
arrange(desc(n_left)) %>%
select(c(3,2,1))
df_r <- df %>%
group_by(node, right) %>%
summarise(n_right = n()) %>%
arrange(desc(n_right))
df_all <- left_join(df_l, df_r, by = "node")
# A tibble: 8 × 5
# Groups: node [2]
n_left left node right n_right
<int> <chr> <chr> <chr> <int>
1 3 a xxx d 3
2 3 a xxx c 1
3 3 a yyy c 2
4 3 a yyy d 2
5 1 b xxx d 3
6 1 b xxx c 1
7 1 b yyy c 2
8 1 b yyy d 2
The problem is that the way the frquencies are displayed is anything but concise - in fact there are numerous duplicates for the left
/n_left
and, respectively, right
/n_right
columns. The output I'm looking for is without any redundancies, like this:
# n_left left node right n_right
# <int> <chr> <chr> <chr> <int>
# 1 3 a xxx d 3
# 2 1 b xxx c 1
# 3 3 a yyy c 2
# 4 1 b yyy d 2
Any idea how that can be achieved?
CodePudding user response:
Reshaping the data lets you rethink how you can do the aggregation all at once. The select
call at the end could certainly be improved upon so it isn't manual—changing names and rearranging based on containing "left", for example—but this should be a start.
library(dplyr)
df %>%
tidyr::pivot_longer(c(left, right), names_to = "side", values_to = "letter") %>%
count(node, side, letter) %>%
tidyr::pivot_wider(id_cols = node, names_from = side, values_from = c(letter, n),
values_fn = list) %>%
tidyr::unnest(-node) %>%
select(n_left, left = letter_left, node, right = letter_right, n_right)
#> # A tibble: 4 × 5
#> n_left left node right n_right
#> <int> <chr> <chr> <chr> <int>
#> 1 3 a xxx c 1
#> 2 1 b xxx d 3
#> 3 3 a yyy c 2
#> 4 1 b yyy d 2
CodePudding user response:
This comes close... Perhaps it is good enough for you, else it might inspire someone to build onto.
library(data.table)
setDT(df)
df.melt <- melt(df, id.vars = "node")
dcast(df.melt, node value ~ paste0("n_", variable), value.var = "value", fun.aggregate = length)
# node value n_left n_right
# 1: xxx a 3 0
# 2: xxx b 1 0
# 3: xxx c 0 1
# 4: xxx d 0 3
# 5: yyy a 3 0
# 6: yyy b 1 0
# 7: yyy c 0 2
# 8: yyy d 0 2