I want to get the mean value (amount) of the first 20%, second 20%, third 20% etc of each group within a data frame.
I have started a reprex below but don't know how to finish it. I think I will need to calculate the start and end of each 20% using a row number, which I've added as the ID is not sequential or predictable.
The actual data will have more than 2 groups and there will be more than one value to mean.
library(tidyverse)
df <- data.frame(id = c(0, 12, 23, 34, 45, 56, 67, 78, 89, 90, 9, 98, 87, 76, 65, 54, 43, 32, 21, 99),
group = c(1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2),
amount = c(102, 312, 567, 432, 345, 987, 94, 34, 54, 32, 92, 67, 32, 49, 36, 31, 19, 18, 17, 308))
groupSplitMeans <- df %>%
mutate(rowNumber = row_number()) %>%
group_by(group)
CodePudding user response:
You could use ntile()
to break the data into n
groups.
df %>%
group_by(group) %>%
mutate(nth = ntile(n = 5)) %>%
group_by(group, nth) %>%
summarise(amount = mean(amount)) %>%
ungroup()
# # A tibble: 10 × 3
# group nth amount
# <dbl> <int> <dbl>
# 1 1 1 207
# 2 1 2 540.
# 3 1 3 32
# 4 1 4 34
# 5 1 5 17.5
# 6 2 1 500.
# 7 2 2 190.
# 8 2 3 73
# 9 2 4 51.5
# 10 2 5 170.
The code above splits the data according to the row numbers, regardless of how much amount
is. If amount
need to be sorted before split, insert it into ntile
:
df %>%
...
mutate(nth = ntile(amount, n = 5)) %>%
...
# # A tibble: 10 × 3
# group nth amount
# <dbl> <int> <dbl>
# 1 1 1 17.5
# 2 1 2 25.5
# 3 1 3 40.5
# 4 1 4 98
# 5 1 5 650.
# 6 2 1 32.5
# 7 2 2 45
# 8 2 3 79.5
# 9 2 4 326.
# 10 2 5 500.
Now you could see that the means of amount
are increasing.