Calculating rowsums of grouped data-CodePudding

I am using the datase who (available in the library ~~datasets~~ tidyr), which for 34 years counts the number of TB cases registered for 56 groups (combinations of gender, age and method of testing) for a number of countries. There is one row per country per year, and the first 4 entries are to do with year, country name and such.

I want to calculate the sum of new cases per country per year, but I just can't make it work.

I was ecpecting something like

group_by(who, country) %>% summarise(count = rowsum(.[5:60]))

would work, but it doesn't.

Can anyone help me understand why it doesn't work, and what to do instead?

CodePudding user response：

You're missing a first step, which is to gather the data into a 'tidy' format. Try this:

who%>%
  gather(key=type,value=cases,-country,-iso2,-iso3,-year)%>%
  filter(!is.na(cases))%>%
  group_by(country,year)%>%
  summarise(sum(cases))

Which gives output:

# A tibble: 3,484 × 3
# Groups:   country [219]
   country      year `sum(cases)`
   <chr>       <int>        <int>
 1 Afghanistan  1997          128
 2 Afghanistan  1998         1778
 3 Afghanistan  1999          745
 4 Afghanistan  2000         2666
 5 Afghanistan  2001         4639

CodePudding user response：

library(tidyverse)

(long_who <- who |> pivot_longer(cols = -(1:4)))

long_who |> filter(startsWith(name,"new")) |> # dont want things like "Population"
  group_by(country) |> 
  summarise(sum_of_new_ = sum(value,na.rm=TRUE))

CodePudding user response：

A base r approach

data.frame(who[,c("country", "year")],
           cnt = rowSums(who[5:60], na.rm = TRUE))

#>         country year cnt
#> 1 Afghanistan 1980   0
#> 2 Afghanistan 1981   0
#> 3 Afghanistan 1982   0
#> 4 Afghanistan 1983   0
#> 5 Afghanistan 1984   0
#> 6 Afghanistan 1985   0

CodePudding user response：

You could also do without the long format by using rowSums and across:

library(dplyr)

who |> 
  group_by(country, year) |>
  summarise(count = rowSums(across(5:58), na.rm = TRUE)) |>
  ungroup()

Alternatives to across(5:58):

across(starts_with("new"))
across(-(1:4))

Output:

# A tibble: 20 × 3
# Groups:   country [1]
   country      year count
   <chr>       <int> <dbl>
 1 Afghanistan  1980     0
 2 Afghanistan  1981     0
 3 Afghanistan  1982     0
 4 Afghanistan  1983     0
 5 Afghanistan  1984     0
 6 Afghanistan  1985     0
 7 Afghanistan  1986     0
 8 Afghanistan  1987     0
 9 Afghanistan  1988     0
10 Afghanistan  1989     0
11 Afghanistan  1990     0
12 Afghanistan  1991     0
13 Afghanistan  1992     0
14 Afghanistan  1993     0
15 Afghanistan  1994     0
16 Afghanistan  1995     0
17 Afghanistan  1996     0
18 Afghanistan  1997   128
19 Afghanistan  1998  1778
20 Afghanistan  1999   745