Home > Blockchain >  How do I compress/group rows when there is no criteria in R?
How do I compress/group rows when there is no criteria in R?

Time:04-20

I have a data set that looks something like this

Age Sex Total
0-4 Female 2
0-4 Male 23
5-9 Female 43
5-9 Male 20
10-14 Female 36
10-14 Male 15
... ... ...
... ... ...
85 Female 230
85 Male 150

I want to group the rows so that the age bands consist of 0-17; 18-64; 65 . So the data frame would look something like...

Age Sex Total
0-17 Female 23
0-17 Male 27
18-64 Female 549
18-64 Male 437
65 Female 871
65 Male 789

Because there is no obvious criteria for grouping I cannot use the regular functions for grouping rows such as the group_by function.

Any suggestions welcome - Thanks!

CodePudding user response:

You may use strsplit on "-" in a by and subset to second element with bracket function in sapply. The '65 ' elements getr special care with gsub and ad NA. Then cut at desired breaks, finally rbind. Here a small example:

dat
#        a s   v
# 1   0-10 F  49
# 2  11-20 F  65
# 3  21-30 F  25
# 4  31-40 F  74
# 5  41-50 F 100
# 6  51-65 F  18
# 7    65  F  49
# 8   0-10 M  47
# 9  11-20 M  24
# 10 21-30 M  71
# 11 31-40 M 100
# 12 41-50 M  89
# 13 51-65 M  37
# 14   65  M  20

res <- by(dat, dat$s, \(x) {
  sp <- strsplit(x$a, '-')
  l1 <- which(lengths(sp) == 1)
  sp[[l1]] <- c(NA, gsub('\\ ', '', sp[[l1]]))
  x$a2 <- cut(as.numeric(sapply(sp, `[`, 2)), c(0, 20, 40, Inf),
              labels=c('0-20', '21-40', '41 '))
  aggregate(v ~ a2   s, x, sum)
}) |> do.call(what=rbind)

Yields

res
#        a2 s   v
# F.1  0-20 F 114
# F.2 21-40 F  99
# F.3   41  F 167
# M.1  0-20 M  71
# M.2 21-40 M 171
# M.3   41  M 146

stopifnot(identical(sum(dat$v), sum(res$v)))

Data:

dat <- structure(list(a = c("0-10", "11-20", "21-30", "31-40", "41-50", 
"51-65", "65 ", "0-10", "11-20", "21-30", "31-40", "41-50", "51-65", 
"65 "), s = c("F", "F", "F", "F", "F", "F", "F", "M", "M", "M", 
"M", "M", "M", "M"), v = c(49L, 65L, 25L, 74L, 100L, 18L, 49L, 
47L, 24L, 71L, 100L, 89L, 37L, 20L)), out.attrs = list(dim = c(a = 7L, 
s = 2L), dimnames = list(a = c("a=0-10", "a=11-20", "a=21-30", 
"a=31-40", "a=41-50", "a=51-65", "a=65 "), s = c("s=F", "s=M"
))), row.names = c(NA, -14L), class = "data.frame")

CodePudding user response:

df %>%
  rownames_to_column('rn') %>%
  group_by(rn) %>%
  separate_rows(Age, convert = TRUE) %>%
  mutate(Age = na_if(Age, ''))%>%
  fill(Age) %>%
  mutate(Age = cut(Age, c(0, 17, 64, 120), c('0-17', '18-64', '65 '), 
             include.lowest = TRUE)) %>%
  distinct() %>%
  group_by(Age, Sex) %>%
  summarise(Total = sum(Total), .groups = 'drop')

# A tibble: 4 x 3
  Age   Sex    Total
  <fct> <chr>  <int>
1 0-17  Female    81
2 0-17  Male      58
3 65    Female   230
4 65    Male     150
  • Related