Home > OS >  Count pairs by groups in R?
Count pairs by groups in R?

Time:02-26

Similar questions have been asked here and here. However, they don't solve my specific problem.

I am trying to count pairs of observations in a data frame, where the data frame is grouped by two other variables.

For example, if I have a data frame like the one below:

library(dplyr)

set.seed(100)
dft <- data.frame(
  var = sample(LETTERS[1:5], 10, replace = TRUE),
  num = c(1,1,2,1,1,1,2,1,1,1),
  iter = c(1,1,1,2,2,2,2,3,3,3)
)

dft <- dft %>% 
  group_by(iter, num)

> dft
   var num iter
1    B   1    1
2    C   1    1
3    A   2    1
4    B   1    2
5    D   1    2
6    D   1    2
7    B   2    2
8    C   1    3
9    B   1    3
10   E   1    3

In my example, the pairs are counted as the observation and the one preceding it. E.g., if we have something like B,C,B,A in one grouping, the pairs would be: B:C, C:B and B:A

We can see that the pair B:C appears once when iter == 1 and num == 1.

The pairs B:D and D:D appear once each when iter == 2 and num == 1 and

The pairs C:B and B:E appear once each when iter == 3 and num == 1.

I was thinking of doing something like this:

g1 <- expand.grid(dft$var, sort(dft$var), iter = dft$iter)
g1$count <- NA

But filling the g1$count column with how many times they appear. However, I cant figure out a way to actually count the pairs by the groups?

Additionally, reversed pairings are not equivalent. For example, in my example, the pair B:E is not equivalent to the pair E:B

Any suggestions as to how I can count these pairs?

CodePudding user response:

dft %>%
  group_by(iter, num) %>%
  summarise(nn = paste(var, lead(var, default = ''), sep = ':'),
            .groups = 'keep') %>%
  count(nn) %>%
  filter(str_detect(nn, '.:.'))

# A tibble: 5 x 4
# Groups:   iter, num [3]
   iter   num nn        n
  <dbl> <dbl> <chr> <int>
1     1     1 B:C       1
2     2     1 B:D       1
3     2     1 D:D       1
4     3     1 B:E       1
5     3     1 C:B       1

CodePudding user response:

Here is a base R solution.

set.seed(100)
dft <- data.frame(
  var = sample(LETTERS[1:5], 10, replace = TRUE),
  num = c(1,1,2,1,1,1,2,1,1,1),
  iter = c(1,1,1,2,2,2,2,3,3,3)
)

sp <- split(dft$var, list(dft$num, dft$iter))
res <- lapply(sp, \(x){
  table(paste(x[-length(x)], x[-1], sep = ":"))
})
res <- res[sapply(res, nrow) > 0L]
res <- lapply(seq_along(res), \(i){
  nms <- strsplit(names(res)[i], "\\.")[[1]]
  dat <- as.data.frame(res[[i]], responseName = "count")
  cbind(dat, num = nms[1], iter = nms[2])
})
res <- do.call(rbind, res)
res
#>   Var1 count num iter
#> 1  B:C     1   1    1
#> 2  B:D     1   1    2
#> 3  D:D     1   1    2
#> 4  B:E     1   1    3
#> 5  C:B     1   1    3

Created on 2022-02-25 by the reprex package (v2.0.1)

  •  Tags:  
  • r
  • Related