Home > Software engineering >  How to choose column with the largest number by row
How to choose column with the largest number by row

Time:09-15

I have a data frame. Each row is a separate person. I need to create a data frame that only shows the latest "date" and "salary" per row. Below is an example of the data frame I'm starting with:

example_df <- tribble(
  ~person_id, ~date1, ~date2, ~date3, ~salaary1, ~salary2, ~salary3,
  1, 2010, 2013, 2015, 100, 200, 300,
  2, 1998, NA, NA, 50, NA, NA, 
  3, 2000, 2001, NA, 100, 200, NA, 
  4, 1987, 1989, 2005, 50, 300, 500
)

This is what I need the data frame to look like after processing:

example_clean_df <- tribble(
  ~person_id, ~date, ~salaary,
  1, 2015,300,
  2, 1998, 50,
  3,  2001, 200,
  4, 2005, 500
) 

Any ideas would be super helpful. Thank you!

CodePudding user response:

Does this work:

library(dplyr)

example_df %>% 
        rowwise() %>% 
                mutate(date = max(date1, date2, date3, na.rm = 1),
                                  salary = max(salaary1, salary2, salary3, na.rm = 1)) %>% 
                                              select(person_id, date, salary)
# A tibble: 4 × 3
# Rowwise: 
  person_id  date salary
      <dbl> <dbl>  <dbl>
1         1  2015    300
2         2  1998     50
3         3  2001    200
4         4  2005    500

CodePudding user response:

Use pivot_longer and slice_max:

library(dplyr)
library(tidyr)
example_df %>% 
  pivot_longer(-person_id, names_pattern = "(date|salary)(\\d)", names_to = c(".value", "number")) %>% 
  group_by(person_id) %>% 
  slice_max(salary) %>% 
  select(-number)

output

# A tibble: 4 × 3
# Groups:   person_id [4]
  person_id  date salary
      <dbl> <dbl>  <dbl>
1         1  2015    300
2         2  1998     50
3         3  2001    200
4         4  2005    500

CodePudding user response:

The primitive base r / for loop version:

result.df <- list()

for(i in seq(nrow(example_df))){
  result.df[[i]] <- cbind(example_df[1][i,], 
                         max(example_df[,2:4][i,], na.rm = T), 
                         max(example_df[,5:7][i,], na.rm = T))
}

result.df <- setNames(do.call(rbind, result.df), c('person_id', 'date', 'salary'))

  person_id date salary
1         1 2015    300
2         2 1998     50
3         3 2001    200
4         4 2005    500

PS:

microbenchmark suggests using @Karthik S method, as it is the fastest.

# test1: loop base R
# test2: rowwise mutate
# test3: pivot_longer

     min       lq      mean   median       uq     max neval
  9.9957 10.41815 11.858530 10.86845 11.97645 21.8334   100
  7.6594  7.96195  9.457389  8.29315  9.49365 25.9524   100
 12.0949 12.49685 14.080567 12.83050 13.85300 26.6272   100

PPS:

using lapply speeds up the process:

result.df <- lapply(seq(nrow(example_df)), \(x) cbind(example_df[1][x,], 
                                                          max(example_df[,2:4][x,], na.rm = T), 
                                                          max(example_df[,5:7][x,], na.rm = T))) %>% 
      do.call(rbind, .) %>% 
      setNames(c('person_id', 'date', 'salary'))

     min       lq      mean   median       uq     max neval
  3.6828  3.89075  5.754244  4.41195  7.14130 14.0325   100
  •  Tags:  
  • r
  • Related