Home > front end >  Using group by to get the value corresponding to the max value of another column
Using group by to get the value corresponding to the max value of another column

Time:12-01

In my dataset respondents are grouped together and there is data available about their age. I want all the people in the same group to have the value of the oldest person in that group.

So my example data looks like this.

      df <- data.frame(groups = c(1,1,1,2,2,2,3,3,3), 
                       age = c(12, 23, 34, 13, 24, 35, 13, 25, 36), 
                       value = c(1, 2, 3, 4, 5, 6, 7, 8, 9))

> df
  groups age value
1      1  12     1
2      1  23     2
3      1  34     3
4      2  13     4
5      2  24     5
6      2  35     6
7      3  13     7
8      3  25     8
9      3  36     9

And I want it to look this this

> df
  groups age value new_value
1      1  12     1         3
2      1  23     2         3
3      1  34     3         3
4      2  13     4         6
5      2  24     5         6
6      2  35     6         6
7      3  13     7         9
8      3  25     8         9
9      3  36     9         9

Any idea how to do this with dplyr?

I have tried something like this, but it doesn't work

df %>% 
        group_by(groups) %>% 
        mutate(new_value = df$value[which.max(df$age)])

CodePudding user response:

Up front, "never" (okay, almost never) use df$ within a dplyr pipe. In this case, df$value[which.max(df$age)] is referencing the original data each time, not the grouped data. Inside each group in this dataset, value is length 3 whereas df$value is length 9.

The only times I feel it is appropriate to use df$ (referencing the original value of the current dataset) inside a pipe is when it is required to look at pre-pipeline data, in absence of any grouping, reordering, or new variables created outside of the currently-saved (pre-pipeline) version of df.

dplyr

library(dplyr)
df %>%
  group_by(groups) %>%
  mutate(new_value = value[which.max(age)]) %>%
  ungroup()
# # A tibble: 9 x 4
#   groups   age value new_value
#    <dbl> <dbl> <dbl>     <dbl>
# 1      1    12     1         3
# 2      1    23     2         3
# 3      1    34     3         3
# 4      2    13     4         6
# 5      2    24     5         6
# 6      2    35     6         6
# 7      3    13     7         9
# 8      3    25     8         9
# 9      3    36     9         9

data.table

library(data.table)
DT <- as.data.table(df)
DT[, new_value := value[which.max(age)], by = .(groups)]

base R

df$new_value <- ave(seq_len(nrow(df)), df$groups,
                    FUN = function(i) df$value[i][which.max(df$age[i])])
df
#   groups age value new_value
# 1      1  12     1         3
# 2      1  23     2         3
# 3      1  34     3         3
# 4      2  13     4         6
# 5      2  24     5         6
# 6      2  35     6         6
# 7      3  13     7         9
# 8      3  25     8         9
# 9      3  36     9         9

The base R approach seems to be the least-elegant-looking solution. I believe that ave is the best approach, but it has many limitations, first being that it only works on one value-object (value) in the absence of others (we need to know age).

  • Related