Home > Software engineering >  Grab row of dataframe based max value across two data frames
Grab row of dataframe based max value across two data frames

Time:09-15

Here is dataframe 1

  card  value  cat1  cat2  cat3
  <chr> <dbl> <dbl> <dbl> <dbl>
1 A        10     1     2     3
2 A        20     4     5     6
3 B        30     7     8     9
4 A        40    10    11    12

Here is dataframe 2 with the same number of rows and columns

  card  value  cat1  cat2  cat3
  <chr> <dbl> <dbl> <dbl> <dbl>
1 C        11    13    14    15
2 C        19    16    17    18
3 A        35    19    20    21
4 B        45    22    23    24

I want to create a new dataframe that is chosen based on the maximum value of the "value" column. The row of the new dataframe is the entire row of the dataframe that has the highest number in the "value" column.

Thus the desired solution is:

  <chr> <dbl> <dbl> <dbl> <dbl>
1 C        11    13    14    15
2 A        20     4     5     6
3 A        35    19    20    21
4 B        45    22    23    24

These are demo dataframes. The actual data frames are on the order of 200,000 rows. What is the best way to do this? Note, it would also be good to have a column in the new dataframe indicating which data frame the row was from: df_1 or df_2.

Dataframes

df_1 <- structure(list(card = c("A", "A", "B", "A"), value = c(10, 20, 
30, 40), cat1 = c(1, 4, 7, 10), cat2 = c(2, 5, 8, 11), cat3 = c(3, 
6, 9, 12)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-4L))

df_2 <- structure(list(card = c("C", "C", "A", "B"), value = c(11, 19, 
35, 45), cat1 = c(13, 16, 19, 22), cat2 = c(14, 17, 20, 23), 
    cat3 = c(15, 18, 21, 24)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L))

CodePudding user response:

base solutions

  1. ifelse
do.call(rbind,
  ifelse(df1$value >= df2$value,
         split(df1, 1:nrow(df1)),
         split(df2, 1:nrow(df2)))
)
  1. lapply
do.call(rbind, lapply(1:nrow(df1), \(x) {
  if(df1$value[x] >= df2$value[x]) df1[x, ] else df2[x, ]
}))

#   card value cat1 cat2 cat3
# 1    C    11   13   14   15
# 2    A    20    4    5    6
# 3    A    35   19   20   21
# 4    B    45   22   23   24

CodePudding user response:

You can probably avoid needing to do this as a grouped operation if you stack each dataset and offset the row indexes. E.g.:

sel <- max.col(cbind(df_1$value, df_2$value))
rbind(df_1, df_2)[seq_along(sel)   c(0,nrow(df_1))[sel],]

## A tibble: 4 x 5
#  card  value  cat1  cat2  cat3
#  <chr> <dbl> <dbl> <dbl> <dbl>
#1 C        11    13    14    15
#2 A        20     4     5     6
#3 A        35    19    20    21
#4 B        45    22    23    24

sel will contain the source dataset too.

cbind(rbind(df_1, df_2)[seq_along(sel)   c(0,nrow(df_1))[sel],], src=sel)

#  card value cat1 cat2 cat3 src
#1    C    11   13   14   15   2
#2    A    20    4    5    6   1
#3    A    35   19   20   21   2
#4    B    45   22   23   24   2
  • Related