I have two dataframes
: df1
has 14k rows with missing rows and df2
has 100k rows of complete rows.
df1:
combo main main_cost rating
1 burger_coke burger 8 great
2 burger_fries burger 7 fine
3 steak_cheese steak okay
4 steak_mash steak awesome
5 salad_dressing salad
6 salad_fruit salad 5 great
7 pizza_rolls pizza 3 fine
8 pizza_bagels pizza 13
9 pizza_veg pizza great
df2:
have_it main main_cost rating distance(mi)
1 TRUE burger 8 great 20
2 FALSE burger 7 fine 56
3 TRUE steak 4 okay 30
4 FALSE steak 7 awesome 19
5 TRUE salad 10 decent 78
6 FALSE salad 5 great 66
7 FALSE pizza 3 fine 12
8 TRUE pizza 13 bad 14
9 FALSE pizza 11 great 40
I would like to merge the two into the smaller dataframe as the "combo" column is the key I am using. I only want to append the new columns and fill in the rows where there were missing values. I believe the join would be on the "main" column as it is the only column where nearly every value in df2
exists in df1
. Ideally, I would like to end up with something like this:
df3:
combo main main_cost rating distance(mi) have_it
1 burger_coke burger 8 great 20 TRUE
2 burger_fries burger 7 fine 56 FALSE
3 steak_cheese steak 4 okay 30 TRUE
4 steak_mash steak 7 awesome 19 FALSE
5 salad_dressing salad 10 decent 78 TRUE
6 salad_fruit salad 5 great 66 FALSE
7 pizza_rolls pizza 3 fine 12 FALSE
8 pizza_bagels pizza 13 bad 14 TRUE
9 pizza_veg pizza 11 great 40 FALSE
I have tried all kinds of merges and joins but keep ending up with a huge df containing NA's where I want the filled in values.
CodePudding user response:
You may try to impute missing values in df1
with adjacent non-missings of df2
. Then just merge
, where "main"
, "main_cost"
, and "rating"
columns will automatically be selected. Just "main"
would be insufficient, because there are ties.
df1[3:4] <- lapply(names(df2)[3:4], \(z)
mapply(\(x, y) el(na.omit(c(x, y))), df1[[z]], df2[[z]]))
(res <- merge(df1, df2))
# main main_cost rating combo have_it distance_mi
# 1 burger 7 fine burger_fries FALSE 56
# 2 burger 8 great burger_coke TRUE 20
# 3 pizza 11 great pizza_veg FALSE 40
# 4 pizza 13 bad pizza_bagels TRUE 14
# 5 pizza 3 fine pizza_rolls FALSE 12
# 6 salad 10 decent salad_dressing TRUE 78
# 7 salad 5 great salad_fruit FALSE 66
# 8 steak 4 okay steak_cheese TRUE 30
# 9 steak 7 awesome steak_mash FALSE 19
Note, that this probably only works if the data frames are of same size and row order, and values are successfully imputed so that the merging columns become identical. If NA's are left, say in the "rating"
column, try to explicitly specify the merging columns using e.g. by=c("main", "main_cost")
where you will end up with "rating.x"
and "rating.y"
, though.
Data:
df1 <- structure(list(combo = c("burger_coke", "burger_fries", "steak_cheese",
"steak_mash", "salad_dressing", "salad_fruit", "pizza_rolls",
"pizza_bagels", "pizza_veg"), main = c("burger", "burger", "steak",
"steak", "salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, NA, NA, NA, 5L, 3L, 13L, NA), rating = c("great", "fine",
"okay", "awesome", NA, "great", "fine", NA, "great")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))
df2 <- structure(list(have_it = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE), main = c("burger", "burger", "steak", "steak",
"salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, 4L, 7L, 10L, 5L, 3L, 13L, 11L), rating = c("great", "fine",
"okay", "awesome", "decent", "great", "fine", "bad", "great"),
distance_mi = c(20L, 56L, 30L, 19L, 78L, 66L, 12L, 14L, 40L
)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9"))