How to count the different values by comparing two columns in R?-CodePudding

I want to count the values by comparing two columns of the dataframe in R.

For example:

col1  col2
A      A
A      A
A      B
G      G
G      H
Y      Y
Y      Y
J      P
J      P
J      J
K      L

I wish to get an output which shows the count of match (if two columns have same values) and the count of not match (if two columns have different value) and display the percentage of match and not match in the next columns

col1   count_match  count_notmatch   percent_match   percent_notmatch
A       2           1                66.66%          33.33%
G       1           1                50.00%          50.00%
Y       2           0                100.00%         0
J       1           2                33.33%          66.66%
K       0           1                0               100%

How do I achieve this? Thanks for any help.

CodePudding user response：

You could group the data by col1 and summarise():

library(dplyr)

df %>%
  group_by(col1) %>%
  summarise(count_match = sum(col1 == col2),
            count_nomatch = n() - count_match,
            across(contains("match"), ~ .x / n() * 100, .names = "{sub('count', 'percent', .col)}"))

# # A tibble: 5 × 5
#   col1  count_match count_nomatch percent_match percent_nomatch
#   <chr>       <int>         <int>         <dbl>           <dbl>
# 1 A               2             1          66.7            33.3
# 2 G               1             1          50              50
# 3 J               1             2          33.3            66.7
# 4 K               0             1           0             100  
# 5 Y               2             0         100               0

CodePudding user response：

You can build the summary table in a few steps:

library(tidyverse)
library(scales)

d <- structure(list(col1 = c("A", "A", "A", "G", "G", "Y", "Y", "J", "J", "J", "K"), 
                    col2 = c("A", "A", "B", "G", "H", "Y", "Y", "P", "P", "J", "L")), class = "data.frame", row.names = c(NA, -11L))


d %>% 
  mutate(match = col1 == col2, 
         nomatch = !match) %>% 
  group_by(col1) %>% 
  summarise(count_match = sum(match),
            count_nomatch = sum(nomatch)) %>% 
  rowwise() %>% 
  mutate(percent_match = count_match/sum(count_match, count_nomatch), 
         percent_nomatch = 1 - percent_match) %>% 
  mutate(across(starts_with("percent"), ~percent(.x))) %>%
  ungroup()
#> # A tibble: 5 × 5
#>   col1  count_match count_nomatch percent_match percent_nomatch
#>   <chr>       <int>         <int> <chr>         <chr>          
#> 1 A               2             1 67%           33%            
#> 2 G               1             1 50%           50%            
#> 3 J               1             2 33%           67%            
#> 4 K               0             1 0%            100%           
#> 5 Y               2             0 100%          0%

^{Created on 2022-07-18 by the reprex package (v2.0.1)}