Text subsetting of a data frame in R-CodePudding

I have two vectors with given names as follows in R:

A <- data.frame(c("Nick", "Maria", "Liam", "Oliver", "Sophia", "james", "Lucas; Luc"))
B  <- data.frame(c("Liam", "Luc", "Evelyn; Eva", "James", "Harper", "Amelia"))

I want to compare the two vectors and create a vector C with the names of vector B that are not in the vector A. I want the code to ignore the capital letters, i.e. to recognise that James and james is the same and if the name appear as two names (given name and preferred name), e.g., Lucas; Luc, to recognise it as the same. In the end, the result must be

C <- data.frame(c("Evelyn; Eva", "Harper","Amelia"))

Can someone help me?

CodePudding user response：

Not understanding what you mean by names appear as part of double name. However, the bulk of your question can be answered with:

setdiff() to identify the differences between two sets (from tidyverse/dplyr library), and
the base toupper() to make all strings capitalized (or tolower() for the reverse)

We also need to name the column in the dataframe. I called both "x"

A <- data.frame(x=c("Nick", "Maria", "Liam", "Oliver", "Sophia", "james", "Lucas; Luc"))
B <- data.frame(x=c("Liam", "Luc", "Evelyn; Eva", "James", "Harper", "Amelia"))

library(tidyverse)
?union
union(setdiff(toupper(A$x), toupper(B$x)), setdiff(toupper(B$x), toupper(A$x)))

#[1] "NICK"        "MARIA"       "OLIVER"      "SOPHIA"      "LUCAS; LUC"  "LUC"         "EVELYN; EVA" "HARPER"      "AMELIA"

Note that setdiff is asymmetrical, so we need to do A different from B, combined with B different from A.

More info here Unexpected behavior for setdiff() function in R

CodePudding user response：

I'm also not sure if i got you right

A <- str_to_title(c("Nick", "Maria", "Liam", "Oliver", "Sophia", "james", "Lucas; Luc"))
B  <- str_to_title(c("Liam", "Luc", "Evelyn; Eva", "James", "Harper", "Amelia"))
c <- B[!A %in% B]

CodePudding user response：

Probably the ugliest code i did but it works.

A <- str_to_title(c("Nick", "Maria", "Liam", "Oliver", "Sophia", "james", "Lucas; Luc"))
B  <- str_to_title(c("Liam", "Luc", "Evelyn; Eva", "James", "Harper", "Amelia"))

# Long version if you wish:
nested <- tibble(given=str_extract(c(A,B),"^[^;] "),
           preferred=str_extract(c(A,B),";\\s*([^;] )") %>% str_extract("[a-zA-Z] "),
           list=c(rep("A",length(A)),rep("B",length(B)))) %>% nest_by(list)
A <- nested$data[[1]]
B <- nested$data[[2]]
unique_b <- B$given %in% A$given | B$given %in% A$preferred

B %>% filter(given %in% B$given[!unique_b]) %>%
  mutate(c=ifelse(is.na(preferred),given,str_c(given,preferred,sep  = "; "))
) %>% pull(c)