Info: So I have 2 dataframes, we'll call them dfa and dfb. Dfa has a master list of things I'm interested in, and dfb has a list of things I want to check for in dfa if they already exist. I'm using a left join to carry over some information in a column from dfb (Note) to dfa based on whether or not the items in dfa existed in b already.
Goal: I want to create a column that says something like "True" or a numerical value if it is not included already in dfa. Or a value that denotes it exists in both files, either way is fine.
Use of dplyr preferred.
Example:
dfa <- data.frame(
ID = c(11,42,21,3,4),
Name = c("ab", "bc", "cd", "de","fg")
)
dfb <- data.frame(
ID = c(11,32,11,3),
Name = c("ab", "bb", "fd", "de"),
Note = c("blue","white","black","yellow")
)
join <- left_join(dfa, dfb, by = c("ID","Name")) %>%
mutate( new = case_when(dfa$ID %in% dfb$ID ~ "exists") )
This just makes a "new" col with NAs all through it. I tried another method:
join <- left_join(dfa, dfb, by = c("ID","Name")) %>%
mutate( new = case_when(dfa$ID == dfb$ID ~ "exists") )
This gave me an error about the length of the dfs because in real life the two dfs are not the same length long (rownum).
join <- left_join(dfa, dfb, by = c("ID","Name")) %>%
mutate( new = case_when(dfa$ID !%in% dfb$ID ~ "is new") )
Also gives me issues about df length.
Desired output would look like:
join:
ID Name new
11 "ab" "exists"
42 "bc"
21 "cd"
3 "de" "exists"
4 "fg"
32 "fd"
11 "de"
Or just 'is new' in the two bottom new columns. Same difference, help would be appreciated. Thanks!
CodePudding user response:
IIUC:
dfa %>% left_join(dfb, by=c('ID', 'Name')) %>% mutate(new= if_else(is.na(Note), '', 'exists'))
output:
ID Name Note new
1 11 ab blue exists
2 42 bc <NA>
3 21 cd <NA>
4 3 de yellow exists
5 4 fg <NA>
This is "a value that denotes it exists in both files", but doesn't look like what you added as expected output, so I'm not sure if I misunderstood something, because in your example I see everything from dfa flagged as "exists"
CodePudding user response:
library(dplyr)
union(dfa,dfb |> select(Name, ID)) |>
left_join(mutate(intersect(dfa, dfb |> select(names(dfa))), new = 1), by = c('Name', 'ID')) |>
mutate(new = if_else(is.na(new), 0, new))
# ID Name new
# 1 11 ab 1
# 2 42 bc 0
# 3 21 cd 0
# 4 3 de 1
# 5 4 fg 0
# 6 32 bb 0
# 7 11 fd 0