I have a data frame. Each row is a separate person. I need to create a data frame that only shows the latest "date" and "salary" per row. Below is an example of the data frame I'm starting with:
example_df <- tribble(
~person_id, ~date1, ~date2, ~date3, ~salaary1, ~salary2, ~salary3,
1, 2010, 2013, 2015, 100, 200, 300,
2, 1998, NA, NA, 50, NA, NA,
3, 2000, 2001, NA, 100, 200, NA,
4, 1987, 1989, 2005, 50, 300, 500
)
This is what I need the data frame to look like after processing:
example_clean_df <- tribble(
~person_id, ~date, ~salaary,
1, 2015,300,
2, 1998, 50,
3, 2001, 200,
4, 2005, 500
)
Any ideas would be super helpful. Thank you!
CodePudding user response:
Does this work:
library(dplyr)
example_df %>%
rowwise() %>%
mutate(date = max(date1, date2, date3, na.rm = 1),
salary = max(salaary1, salary2, salary3, na.rm = 1)) %>%
select(person_id, date, salary)
# A tibble: 4 × 3
# Rowwise:
person_id date salary
<dbl> <dbl> <dbl>
1 1 2015 300
2 2 1998 50
3 3 2001 200
4 4 2005 500
CodePudding user response:
Use pivot_longer
and slice_max
:
library(dplyr)
library(tidyr)
example_df %>%
pivot_longer(-person_id, names_pattern = "(date|salary)(\\d)", names_to = c(".value", "number")) %>%
group_by(person_id) %>%
slice_max(salary) %>%
select(-number)
output
# A tibble: 4 × 3
# Groups: person_id [4]
person_id date salary
<dbl> <dbl> <dbl>
1 1 2015 300
2 2 1998 50
3 3 2001 200
4 4 2005 500
CodePudding user response:
The primitive base r / for loop version:
result.df <- list()
for(i in seq(nrow(example_df))){
result.df[[i]] <- cbind(example_df[1][i,],
max(example_df[,2:4][i,], na.rm = T),
max(example_df[,5:7][i,], na.rm = T))
}
result.df <- setNames(do.call(rbind, result.df), c('person_id', 'date', 'salary'))
person_id date salary
1 1 2015 300
2 2 1998 50
3 3 2001 200
4 4 2005 500
PS:
microbenchmark suggests using @Karthik S method, as it is the fastest.
# test1: loop base R
# test2: rowwise mutate
# test3: pivot_longer
min lq mean median uq max neval
9.9957 10.41815 11.858530 10.86845 11.97645 21.8334 100
7.6594 7.96195 9.457389 8.29315 9.49365 25.9524 100
12.0949 12.49685 14.080567 12.83050 13.85300 26.6272 100
PPS:
using lapply speeds up the process:
result.df <- lapply(seq(nrow(example_df)), \(x) cbind(example_df[1][x,],
max(example_df[,2:4][x,], na.rm = T),
max(example_df[,5:7][x,], na.rm = T))) %>%
do.call(rbind, .) %>%
setNames(c('person_id', 'date', 'salary'))
min lq mean median uq max neval
3.6828 3.89075 5.754244 4.41195 7.14130 14.0325 100