Home > Software engineering >  How to merge two data frames with missing values?
How to merge two data frames with missing values?

Time:03-22

I have two dataframes: df1 has 14k rows with missing rows and df2 has 100k rows of complete rows.

df1:
     combo           main     main_cost   rating     
1    burger_coke     burger      8         great
2    burger_fries    burger      7         fine
3    steak_cheese    steak                 okay
4    steak_mash      steak                 awesome
5    salad_dressing  salad                 
6    salad_fruit     salad       5         great
7    pizza_rolls     pizza       3         fine
8    pizza_bagels    pizza       13      
9    pizza_veg       pizza                 great



df2:
     have_it    main     main_cost    rating     distance(mi)
1    TRUE       burger      8         great      20
2    FALSE      burger      7         fine       56
3    TRUE       steak       4         okay       30
4    FALSE      steak       7         awesome    19
5    TRUE       salad       10        decent     78 
6    FALSE      salad       5         great      66
7    FALSE      pizza       3         fine       12
8    TRUE       pizza       13        bad        14
9    FALSE      pizza       11        great      40

I would like to merge the two into the smaller dataframe as the "combo" column is the key I am using. I only want to append the new columns and fill in the rows where there were missing values. I believe the join would be on the "main" column as it is the only column where nearly every value in df2 exists in df1. Ideally, I would like to end up with something like this:

df3:
     combo           main     main_cost    rating      distance(mi)   have_it  
1    burger_coke     burger      8         great       20             TRUE
2    burger_fries    burger      7         fine        56             FALSE
3    steak_cheese    steak       4         okay        30             TRUE
4    steak_mash      steak       7         awesome     19             FALSE
5    salad_dressing  salad       10        decent      78             TRUE
6    salad_fruit     salad       5         great       66             FALSE
7    pizza_rolls     pizza       3         fine        12             FALSE
8    pizza_bagels    pizza       13        bad         14             TRUE
9    pizza_veg       pizza       11        great       40             FALSE

I have tried all kinds of merges and joins but keep ending up with a huge df containing NA's where I want the filled in values.

CodePudding user response:

You may try to impute missing values in df1 with adjacent non-missings of df2. Then just merge, where "main", "main_cost", and "rating" columns will automatically be selected. Just "main" would be insufficient, because there are ties.

df1[3:4] <- lapply(names(df2)[3:4], \(z) 
                   mapply(\(x, y) el(na.omit(c(x, y))), df1[[z]], df2[[z]]))

(res <- merge(df1, df2))
#     main main_cost  rating          combo have_it distance_mi
# 1 burger         7    fine   burger_fries   FALSE          56
# 2 burger         8   great    burger_coke    TRUE          20
# 3  pizza        11   great      pizza_veg   FALSE          40
# 4  pizza        13     bad   pizza_bagels    TRUE          14
# 5  pizza         3    fine    pizza_rolls   FALSE          12
# 6  salad        10  decent salad_dressing    TRUE          78
# 7  salad         5   great    salad_fruit   FALSE          66
# 8  steak         4    okay   steak_cheese    TRUE          30
# 9  steak         7 awesome     steak_mash   FALSE          19

Note, that this probably only works if the data frames are of same size and row order, and values are successfully imputed so that the merging columns become identical. If NA's are left, say in the "rating" column, try to explicitly specify the merging columns using e.g. by=c("main", "main_cost") where you will end up with "rating.x" and "rating.y", though.


Data:

df1 <- structure(list(combo = c("burger_coke", "burger_fries", "steak_cheese", 
"steak_mash", "salad_dressing", "salad_fruit", "pizza_rolls", 
"pizza_bagels", "pizza_veg"), main = c("burger", "burger", "steak", 
"steak", "salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L, 
7L, NA, NA, NA, 5L, 3L, 13L, NA), rating = c("great", "fine", 
"okay", "awesome", NA, "great", "fine", NA, "great")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9"))

df2 <- structure(list(have_it = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
FALSE, TRUE, FALSE), main = c("burger", "burger", "steak", "steak", 
"salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L, 
7L, 4L, 7L, 10L, 5L, 3L, 13L, 11L), rating = c("great", "fine", 
"okay", "awesome", "decent", "great", "fine", "bad", "great"), 
    distance_mi = c(20L, 56L, 30L, 19L, 78L, 66L, 12L, 14L, 40L
    )), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6", "7", "8", "9"))
  • Related