Similar questions have been asked here and here. However, they don't solve my specific problem.
I am trying to count pairs of observations in a data frame, where the data frame is grouped by two other variables.
For example, if I have a data frame like the one below:
library(dplyr)
set.seed(100)
dft <- data.frame(
var = sample(LETTERS[1:5], 10, replace = TRUE),
num = c(1,1,2,1,1,1,2,1,1,1),
iter = c(1,1,1,2,2,2,2,3,3,3)
)
dft <- dft %>%
group_by(iter, num)
> dft
var num iter
1 B 1 1
2 C 1 1
3 A 2 1
4 B 1 2
5 D 1 2
6 D 1 2
7 B 2 2
8 C 1 3
9 B 1 3
10 E 1 3
In my example, the pairs are counted as the observation and the one preceding it. E.g., if we have something like B,C,B,A
in one grouping, the pairs would be: B:C
, C:B
and B:A
We can see that the pair B:C
appears once when iter == 1
and num == 1
.
The pairs B:D
and D:D
appear once each when iter == 2
and num == 1
and
The pairs C:B
and B:E
appear once each when iter == 3
and num == 1
.
I was thinking of doing something like this:
g1 <- expand.grid(dft$var, sort(dft$var), iter = dft$iter)
g1$count <- NA
But filling the g1$count
column with how many times they appear. However, I cant figure out a way to actually count the pairs by the groups?
Additionally, reversed pairings are not equivalent. For example, in my example, the pair B:E
is not equivalent to the pair E:B
Any suggestions as to how I can count these pairs?
CodePudding user response:
dft %>%
group_by(iter, num) %>%
summarise(nn = paste(var, lead(var, default = ''), sep = ':'),
.groups = 'keep') %>%
count(nn) %>%
filter(str_detect(nn, '.:.'))
# A tibble: 5 x 4
# Groups: iter, num [3]
iter num nn n
<dbl> <dbl> <chr> <int>
1 1 1 B:C 1
2 2 1 B:D 1
3 2 1 D:D 1
4 3 1 B:E 1
5 3 1 C:B 1
CodePudding user response:
Here is a base R solution.
set.seed(100)
dft <- data.frame(
var = sample(LETTERS[1:5], 10, replace = TRUE),
num = c(1,1,2,1,1,1,2,1,1,1),
iter = c(1,1,1,2,2,2,2,3,3,3)
)
sp <- split(dft$var, list(dft$num, dft$iter))
res <- lapply(sp, \(x){
table(paste(x[-length(x)], x[-1], sep = ":"))
})
res <- res[sapply(res, nrow) > 0L]
res <- lapply(seq_along(res), \(i){
nms <- strsplit(names(res)[i], "\\.")[[1]]
dat <- as.data.frame(res[[i]], responseName = "count")
cbind(dat, num = nms[1], iter = nms[2])
})
res <- do.call(rbind, res)
res
#> Var1 count num iter
#> 1 B:C 1 1 1
#> 2 B:D 1 1 2
#> 3 D:D 1 1 2
#> 4 B:E 1 1 3
#> 5 C:B 1 1 3
Created on 2022-02-25 by the reprex package (v2.0.1)