I have a data set that looks something like this
Age | Sex | Total |
---|---|---|
0-4 | Female | 2 |
0-4 | Male | 23 |
5-9 | Female | 43 |
5-9 | Male | 20 |
10-14 | Female | 36 |
10-14 | Male | 15 |
... | ... | ... |
... | ... | ... |
85 | Female | 230 |
85 | Male | 150 |
I want to group the rows so that the age bands consist of 0-17; 18-64; 65 . So the data frame would look something like...
Age | Sex | Total |
---|---|---|
0-17 | Female | 23 |
0-17 | Male | 27 |
18-64 | Female | 549 |
18-64 | Male | 437 |
65 | Female | 871 |
65 | Male | 789 |
Because there is no obvious criteria for grouping I cannot use the regular functions for grouping rows such as the group_by function.
Any suggestions welcome - Thanks!
CodePudding user response:
You may use strsplit
on "-"
in a by
and subset to second element with bracket function in sapply
. The '65 ' elements getr special care with gsub
and ad NA
. Then cut
at desired breaks, finally rbind
. Here a small example:
dat
# a s v
# 1 0-10 F 49
# 2 11-20 F 65
# 3 21-30 F 25
# 4 31-40 F 74
# 5 41-50 F 100
# 6 51-65 F 18
# 7 65 F 49
# 8 0-10 M 47
# 9 11-20 M 24
# 10 21-30 M 71
# 11 31-40 M 100
# 12 41-50 M 89
# 13 51-65 M 37
# 14 65 M 20
res <- by(dat, dat$s, \(x) {
sp <- strsplit(x$a, '-')
l1 <- which(lengths(sp) == 1)
sp[[l1]] <- c(NA, gsub('\\ ', '', sp[[l1]]))
x$a2 <- cut(as.numeric(sapply(sp, `[`, 2)), c(0, 20, 40, Inf),
labels=c('0-20', '21-40', '41 '))
aggregate(v ~ a2 s, x, sum)
}) |> do.call(what=rbind)
Yields
res
# a2 s v
# F.1 0-20 F 114
# F.2 21-40 F 99
# F.3 41 F 167
# M.1 0-20 M 71
# M.2 21-40 M 171
# M.3 41 M 146
stopifnot(identical(sum(dat$v), sum(res$v)))
Data:
dat <- structure(list(a = c("0-10", "11-20", "21-30", "31-40", "41-50",
"51-65", "65 ", "0-10", "11-20", "21-30", "31-40", "41-50", "51-65",
"65 "), s = c("F", "F", "F", "F", "F", "F", "F", "M", "M", "M",
"M", "M", "M", "M"), v = c(49L, 65L, 25L, 74L, 100L, 18L, 49L,
47L, 24L, 71L, 100L, 89L, 37L, 20L)), out.attrs = list(dim = c(a = 7L,
s = 2L), dimnames = list(a = c("a=0-10", "a=11-20", "a=21-30",
"a=31-40", "a=41-50", "a=51-65", "a=65 "), s = c("s=F", "s=M"
))), row.names = c(NA, -14L), class = "data.frame")
CodePudding user response:
df %>%
rownames_to_column('rn') %>%
group_by(rn) %>%
separate_rows(Age, convert = TRUE) %>%
mutate(Age = na_if(Age, ''))%>%
fill(Age) %>%
mutate(Age = cut(Age, c(0, 17, 64, 120), c('0-17', '18-64', '65 '),
include.lowest = TRUE)) %>%
distinct() %>%
group_by(Age, Sex) %>%
summarise(Total = sum(Total), .groups = 'drop')
# A tibble: 4 x 3
Age Sex Total
<fct> <chr> <int>
1 0-17 Female 81
2 0-17 Male 58
3 65 Female 230
4 65 Male 150