I want to count the values by comparing two columns of the dataframe in R.
For example:
col1 col2
A A
A A
A B
G G
G H
Y Y
Y Y
J P
J P
J J
K L
I wish to get an output which shows the count of match (if two columns have same values) and the count of not match (if two columns have different value) and display the percentage of match and not match in the next columns
col1 count_match count_notmatch percent_match percent_notmatch
A 2 1 66.66% 33.33%
G 1 1 50.00% 50.00%
Y 2 0 100.00% 0
J 1 2 33.33% 66.66%
K 0 1 0 100%
How do I achieve this? Thanks for any help.
CodePudding user response:
You could group the data by col1
and summarise()
:
library(dplyr)
df %>%
group_by(col1) %>%
summarise(count_match = sum(col1 == col2),
count_nomatch = n() - count_match,
across(contains("match"), ~ .x / n() * 100, .names = "{sub('count', 'percent', .col)}"))
# # A tibble: 5 × 5
# col1 count_match count_nomatch percent_match percent_nomatch
# <chr> <int> <int> <dbl> <dbl>
# 1 A 2 1 66.7 33.3
# 2 G 1 1 50 50
# 3 J 1 2 33.3 66.7
# 4 K 0 1 0 100
# 5 Y 2 0 100 0
CodePudding user response:
You can build the summary table in a few steps:
library(tidyverse)
library(scales)
d <- structure(list(col1 = c("A", "A", "A", "G", "G", "Y", "Y", "J", "J", "J", "K"),
col2 = c("A", "A", "B", "G", "H", "Y", "Y", "P", "P", "J", "L")), class = "data.frame", row.names = c(NA, -11L))
d %>%
mutate(match = col1 == col2,
nomatch = !match) %>%
group_by(col1) %>%
summarise(count_match = sum(match),
count_nomatch = sum(nomatch)) %>%
rowwise() %>%
mutate(percent_match = count_match/sum(count_match, count_nomatch),
percent_nomatch = 1 - percent_match) %>%
mutate(across(starts_with("percent"), ~percent(.x))) %>%
ungroup()
#> # A tibble: 5 × 5
#> col1 count_match count_nomatch percent_match percent_nomatch
#> <chr> <int> <int> <chr> <chr>
#> 1 A 2 1 67% 33%
#> 2 G 1 1 50% 50%
#> 3 J 1 2 33% 67%
#> 4 K 0 1 0% 100%
#> 5 Y 2 0 100% 0%
Created on 2022-07-18 by the reprex package (v2.0.1)