I have a dataframe like the one below (though much larger).
name | age | sex | favcolor | grade | score |
---|---|---|---|---|---|
tim | NA | NA | blue | 12 | 100 |
tim | 18 | male | red | 12 | 50 |
dave | 17 | male | red | 12 | 85 |
mike | 15 | male | green | 10 | 95 |
john | 12 | male | NA | 7 | 80 |
john | 12 | NA | orange | 7 | 90 |
There are a few things I want to do. My primary goal is to merge the rows by the name variable, such that each name gets one row. Second, when merging rows, I want cells with data to override cells that are NA (tim with age
and sex
. Third, if the rows that are merging both have text values (e.g. tim with favcolor
), I want to keep the first one. And lastly, for rows that both have values in numeric columns (age
, grade
, and score
), I want the new value to be the mean of the merging rows.
If all these rules are followed, the dataframe should look something like this.
name | age | sex | favcolor | grade | score |
---|---|---|---|---|---|
tim | 18 | male | blue | 12 | 75 |
dave | 17 | male | red | 12 | 85 |
mike | 15 | male | green | 10 | 95 |
john | 12 | male | orange | 7 | 85 |
Is there a straightforward way to accomplish this? I've tried about 30 different things, but it never turns out the way I want it to. Any help would be greatly appreciated.
CodePudding user response:
You can group_by(name)
and use summarize
to collapse the rows into a single one.
Finally, relocate
to reorder the columns as the input.
library(dplyr)
df %>%
group_by(name) %>%
summarize(across(where(is.numeric), ~ mean(.x, na.rm = T)),
across(where(is.character), ~.x[!is.na(.x)][1])) %>%
relocate(colnames(df))
# A tibble: 4 × 6
name age sex favcolor grade score
<chr> <dbl> <chr> <chr> <dbl> <dbl>
1 dave 17 male red 12 85
2 john 12 male orange 7 85
3 mike 15 male green 10 95
4 tim 18 male blue 12 75