Home > Back-end >  Count the amount of times value A occurs without value B and vice versa
Count the amount of times value A occurs without value B and vice versa

Time:04-16

I'm having trouble figuring out how to do the opposite of the answer to this question (and in R not python).

Count the amount of times value A occurs with value B

Basically I have a dataframe with a lot of combinations of pairs of columns like so:

df <- data.frame(id1 = c("1","1","1","1","2","2","2","3","3","4","4"),
                 id2 = c("2","2","3","4","1","3","4","1","4","2","1"))

I want to count, how often all the values in column A occur in the whole dataframe without the values from column B. So the results for this small example would be the output of:

df_result <- data.frame(id1 = c("1","1","1","2","2","2","3","3","4","4"),
                        id2 = c("2","3","4","1","3","4","1","4","2","1"),
                        count = c("4","5","5","3","5","4","2","3","3","3"))

The important criteria for this, is that the final results dataframe is collapsed by the pairs (so in my example rows 1 and 2 are duplicates, and they are collapsed and summed by the total frequency 1 is observed without 2). For tallying the count of occurances, it's important that both columns are examined. I.e. order of columns doesn't matter for calculating the frequency - if column A has 1 and B has 2, this counts the same as if column A has 2 and B has 1.

I can do this very slowly by filtering for each pair, but it's not really feasible for my real data where I have many many different pairs.

Any guidance is greatly appreciated.

CodePudding user response:

First paste the two id columns together to id12 for later matching. Then use sapply to go through all rows to see the records where id1 appears in id12 but id2 doesn't. sum that value and only output the distinct records. Finally, remove the id12 column.

library(dplyr)

df %>% mutate(id12 = paste0(id1, id2),
              count = sapply(1:nrow(.), 
                             function(x) 
                               sum(grepl(id1[x], id12) & !grepl(id2[x], id12)))) %>% 
  distinct() %>% 
  select(-id12)

Or in base R completely:

id12 <- paste0(df$id1, df$id2)
df$count <- sapply(1:nrow(df), function(x) sum(grepl(df$id1[x], id12) & !grepl(df$id2[x], id12)))
df <- df[!duplicated(df),]

Output

   id1 id2 count
1    1   2     4
2    1   3     5
3    1   4     5
4    2   1     3
5    2   3     5
6    2   4     4
7    3   1     2
8    3   4     3
9    4   2     3
10   4   1     3

CodePudding user response:

A full tidyverse version:

library(tidyverse)
df %>% 
  mutate(id = paste(id1, id2),
         count = map(cur_group_rows(), ~ sum(str_detect(id, id1[.x]) & str_detect(id, id2[.x], negate = T))))
  • Related