Home > Software design >  Add a row-wise specific column result from left join of two dataframes
Add a row-wise specific column result from left join of two dataframes

Time:09-27

Info: So I have 2 dataframes, we'll call them dfa and dfb. Dfa has a master list of things I'm interested in, and dfb has a list of things I want to check for in dfa if they already exist. I'm using a left join to carry over some information in a column from dfb (Note) to dfa based on whether or not the items in dfa existed in b already.

Goal: I want to create a column that says something like "True" or a numerical value if it is not included already in dfa. Or a value that denotes it exists in both files, either way is fine.

Use of dplyr preferred.

Example:

dfa <- data.frame(
ID = c(11,42,21,3,4),
Name = c("ab", "bc", "cd", "de","fg")
)
dfb <- data.frame(
ID = c(11,32,11,3),
Name = c("ab", "bb", "fd", "de"),
Note = c("blue","white","black","yellow")
)
join <- left_join(dfa, dfb, by = c("ID","Name")) %>%
  mutate( new = case_when(dfa$ID %in% dfb$ID ~ "exists") )

This just makes a "new" col with NAs all through it. I tried another method:

join <- left_join(dfa, dfb, by = c("ID","Name")) %>%
  mutate( new = case_when(dfa$ID == dfb$ID ~ "exists") )

This gave me an error about the length of the dfs because in real life the two dfs are not the same length long (rownum).

join <- left_join(dfa, dfb, by = c("ID","Name")) %>%
  mutate( new = case_when(dfa$ID !%in% dfb$ID ~ "is new") )

Also gives me issues about df length.

Desired output would look like:

join:
ID Name new
11 "ab" "exists"
42 "bc"  
21 "cd" 
3 "de" "exists"
4 "fg" 
32 "fd" 
11 "de"

Or just 'is new' in the two bottom new columns. Same difference, help would be appreciated. Thanks!

CodePudding user response:

IIUC:

dfa %>% left_join(dfb, by=c('ID', 'Name')) %>% mutate(new= if_else(is.na(Note), '', 'exists'))

output:

  ID Name   Note new
1 11   ab   blue exists
2 42   bc   <NA>       
3 21   cd   <NA>       
4  3   de yellow exists
5  4   fg   <NA> 

This is "a value that denotes it exists in both files", but doesn't look like what you added as expected output, so I'm not sure if I misunderstood something, because in your example I see everything from dfa flagged as "exists"

CodePudding user response:

library(dplyr)

union(dfa,dfb |> select(Name, ID)) |> 
  left_join(mutate(intersect(dfa, dfb |> select(names(dfa))), new = 1), by = c('Name', 'ID')) |> 
  mutate(new = if_else(is.na(new), 0, new))

#   ID Name new
# 1 11   ab   1
# 2 42   bc   0
# 3 21   cd   0
# 4  3   de   1
# 5  4   fg   0
# 6 32   bb   0
# 7 11   fd   0
  • Related