I have the following data
df <- data.frame(
group = c('r1','r2','r3','r4'),
X1 = c('A','B','C','K'),
X2 = c('A','C','M','K'),
X3 = c('D','A','C','K')
)
> df
group X1 X2 X3
1 r1 A A D
2 r2 B C A
3 r3 C M C
4 r4 K K K
I want to estimate a 'similarity score' based on columns X1
, X2
& X3
. For example, within group
r1 (or row 1), 2 out of 3 elements are similar so score is 2/3 (~67%). And the group
r4 (or row 4), the score would be 3/3 (100%). The desired outcome is below
> df
group X1 X2 X3 similarity_score
1 r1 A A D .67
2 r2 B C A .33
3 r3 C M C .67
4 r4 K K K 1
How can I achieve this?
CodePudding user response:
You could do
df$similarity <- round(apply(df[-1], 1, function(x) max(table(x))/length(x)), 2)
df
#> group X1 X2 X3 similarity
#> 1 r1 A A D 0.67
#> 2 r2 B C A 0.33
#> 3 r3 C M C 0.67
#> 4 r4 K K K 1.00
Created on 2022-04-18 by the reprex package (v2.0.1)
CodePudding user response:
A tidyverse solution:
library(tidyverse)
df %>%
rowwise() %>%
mutate(
similarity_score = max(colMeans(outer(c_across(-group), c_across(-group), `==`)))
)
Or instead of c_across
, you could do a nest
solution:
df %>%
group_by(group) %>%
nest(data = -group) %>%
rowwise() %>%
mutate(
similarity_score = max(colMeans(outer(unlist(data), unlist(data), `==`)))
) %>%
unnest(data)
group X1 X2 X3 similarity_score
<chr> <chr> <chr> <chr> <dbl>
1 r1 A A D 0.667
2 r2 B C A 0.333
3 r3 C M C 0.667
4 r4 K K K 1
CodePudding user response:
Another possible solution:
library(dplyr)
df %>%
rowwise %>%
mutate(score = max(prop.table(table(c_across(X1:X3))))) %>%
ungroup
#> # A tibble: 4 × 5
#> group X1 X2 X3 score
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 r1 A A D 0.667
#> 2 r2 B C A 0.333
#> 3 r3 C M C 0.667
#> 4 r4 K K K 1
Or even shorter:
library(tidyverse)
df %>% mutate(score = pmap_dbl(across(X1:X3), ~ max(prop.table(table(c(...))))))