Here is dataframe 1
card value cat1 cat2 cat3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 10 1 2 3
2 A 20 4 5 6
3 B 30 7 8 9
4 A 40 10 11 12
Here is dataframe 2 with the same number of rows and columns
card value cat1 cat2 cat3
<chr> <dbl> <dbl> <dbl> <dbl>
1 C 11 13 14 15
2 C 19 16 17 18
3 A 35 19 20 21
4 B 45 22 23 24
I want to create a new dataframe that is chosen based on the maximum value of the "value" column. The row of the new dataframe is the entire row of the dataframe that has the highest number in the "value" column.
Thus the desired solution is:
<chr> <dbl> <dbl> <dbl> <dbl>
1 C 11 13 14 15
2 A 20 4 5 6
3 A 35 19 20 21
4 B 45 22 23 24
These are demo dataframes. The actual data frames are on the order of 200,000 rows. What is the best way to do this? Note, it would also be good to have a column in the new dataframe indicating which data frame the row was from: df_1 or df_2.
Dataframes
df_1 <- structure(list(card = c("A", "A", "B", "A"), value = c(10, 20,
30, 40), cat1 = c(1, 4, 7, 10), cat2 = c(2, 5, 8, 11), cat3 = c(3,
6, 9, 12)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))
df_2 <- structure(list(card = c("C", "C", "A", "B"), value = c(11, 19,
35, 45), cat1 = c(13, 16, 19, 22), cat2 = c(14, 17, 20, 23),
cat3 = c(15, 18, 21, 24)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L))
CodePudding user response:
base
solutions
ifelse
do.call(rbind,
ifelse(df1$value >= df2$value,
split(df1, 1:nrow(df1)),
split(df2, 1:nrow(df2)))
)
lapply
do.call(rbind, lapply(1:nrow(df1), \(x) {
if(df1$value[x] >= df2$value[x]) df1[x, ] else df2[x, ]
}))
# card value cat1 cat2 cat3
# 1 C 11 13 14 15
# 2 A 20 4 5 6
# 3 A 35 19 20 21
# 4 B 45 22 23 24
CodePudding user response:
You can probably avoid needing to do this as a grouped operation if you stack each dataset and offset the row indexes. E.g.:
sel <- max.col(cbind(df_1$value, df_2$value))
rbind(df_1, df_2)[seq_along(sel) c(0,nrow(df_1))[sel],]
## A tibble: 4 x 5
# card value cat1 cat2 cat3
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 C 11 13 14 15
#2 A 20 4 5 6
#3 A 35 19 20 21
#4 B 45 22 23 24
sel
will contain the source dataset too.
cbind(rbind(df_1, df_2)[seq_along(sel) c(0,nrow(df_1))[sel],], src=sel)
# card value cat1 cat2 cat3 src
#1 C 11 13 14 15 2
#2 A 20 4 5 6 1
#3 A 35 19 20 21 2
#4 B 45 22 23 24 2