Home > Back-end >  How can we return the number of common characters in two strings in R?
How can we return the number of common characters in two strings in R?

Time:03-23

I want to search two character strings and return the number of common characters. So, if we have s1 = "aabcc" and s2 = "adcaa", the output should be solution(s1, s2) = 3. (s1 and s2 have 3 common characters - 2 "a"s and 1 "c".)

My idea was to concatenate the two strings using paste and then to check the count of each distinct character in our new string. If the count is even, I would add half of that count to a count variable (so, if we have four a's, then we have two pairs), and if the count for some character is odd, then we take away one and add half of that number to our count (effectively disregarding the extra occurrence of the character that cannot be paired.

I thought maybe I could do this by getting our characters in a data.frame which recorded the count of each letter, but the code to do that is getting inordinately long:

df <- as.data.frame(paste(s1,s2,sep="") %>%
## keep first column only and name it     'characters':
select('characters' = 1) %>%
## multiple cell values (as separated by a   blank)
## into separate rows:
separate_rows(characters, sep = " ") %>%
group_by(characters) %>%
summarise(count = n()) %>%
arrange(desc(count))

So i'm now thinking that I've overcomplicated this whole thing. Can anyone point me in the right direction? is my initial idea sensible or is it off the mark?

Clarification: Strings are not necessarily the same length, but they are both always between 1 and 14 characters long.

Clarification2: Ideally solutions will be in base R (no packages) because that is what I'm trying to get competent in first, but all other solutions still welcome

CodePudding user response:

Here's an approach that first strsplit() the string, then use vecsets::vintersect() to output the intersecting characters (duplicated character will also be shown). Then output the length of the intersecting characters.

This should work with strings that have different length.

library(vecsets)

length(vintersect(strsplit(s1, "")[[1]], strsplit(s2, "")[[1]]))
[1] 3

CodePudding user response:

inner_join(as.data.frame(table(strsplit(s1, "") )),
          as.data.frame(table(strsplit(s2, "") )),
          by = "Var1") %>% 
  mutate(Freq.diff = pmin(Freq.x, Freq.y)) %>% 
  pull(Freq.diff) %>% 
  sum()

Or with just base R:

df <- merge(as.data.frame(table(strsplit(s1, ""))),
            as.data.frame(table(strsplit(s2, ""))),
            by = 1) 

sum(pmin(df$Freq.x, df$Freq.y))

CodePudding user response:

Another possible solution:

library(tidyverse)

s1 = "aabcc" 
s2 = "adcaa"

data.frame(x = table(str_split(s1,"", simplify = T)[1,])) %>% 
  inner_join(data.frame(x = table(str_split(s2,"", simplify = T)[1,])), by="x.Var1") %>% 
  apply(1, min) %>% as.numeric %>% sum

#> [1] 3
  • Related