Home > OS >  Calculating rowsums of grouped data
Calculating rowsums of grouped data

Time:08-24

I am using the datase who (available in the library datasets tidyr), which for 34 years counts the number of TB cases registered for 56 groups (combinations of gender, age and method of testing) for a number of countries. There is one row per country per year, and the first 4 entries are to do with year, country name and such.

I want to calculate the sum of new cases per country per year, but I just can't make it work.

I was ecpecting something like

group_by(who, country) %>% summarise(count = rowsum(.[5:60]))

would work, but it doesn't.

Can anyone help me understand why it doesn't work, and what to do instead?

CodePudding user response:

You're missing a first step, which is to gather the data into a 'tidy' format. Try this:

who%>%
  gather(key=type,value=cases,-country,-iso2,-iso3,-year)%>%
  filter(!is.na(cases))%>%
  group_by(country,year)%>%
  summarise(sum(cases))

Which gives output:

# A tibble: 3,484 × 3
# Groups:   country [219]
   country      year `sum(cases)`
   <chr>       <int>        <int>
 1 Afghanistan  1997          128
 2 Afghanistan  1998         1778
 3 Afghanistan  1999          745
 4 Afghanistan  2000         2666
 5 Afghanistan  2001         4639

CodePudding user response:

library(tidyverse)

(long_who <- who |> pivot_longer(cols = -(1:4)))

long_who |> filter(startsWith(name,"new")) |> # dont want things like "Population"
  group_by(country) |> 
  summarise(sum_of_new_ = sum(value,na.rm=TRUE))

CodePudding user response:

A base r approach

data.frame(who[,c("country", "year")],
           cnt = rowSums(who[5:60], na.rm = TRUE))

#>         country year cnt
#> 1 Afghanistan 1980   0
#> 2 Afghanistan 1981   0
#> 3 Afghanistan 1982   0
#> 4 Afghanistan 1983   0
#> 5 Afghanistan 1984   0
#> 6 Afghanistan 1985   0

CodePudding user response:

You could also do without the long format by using rowSums and across:

library(dplyr)

who |> 
  group_by(country, year) |>
  summarise(count = rowSums(across(5:58), na.rm = TRUE)) |>
  ungroup()

Alternatives to across(5:58):

  • across(starts_with("new"))
  • across(-(1:4))

Output:

# A tibble: 20 × 3
# Groups:   country [1]
   country      year count
   <chr>       <int> <dbl>
 1 Afghanistan  1980     0
 2 Afghanistan  1981     0
 3 Afghanistan  1982     0
 4 Afghanistan  1983     0
 5 Afghanistan  1984     0
 6 Afghanistan  1985     0
 7 Afghanistan  1986     0
 8 Afghanistan  1987     0
 9 Afghanistan  1988     0
10 Afghanistan  1989     0
11 Afghanistan  1990     0
12 Afghanistan  1991     0
13 Afghanistan  1992     0
14 Afghanistan  1993     0
15 Afghanistan  1994     0
16 Afghanistan  1995     0
17 Afghanistan  1996     0
18 Afghanistan  1997   128
19 Afghanistan  1998  1778
20 Afghanistan  1999   745
  • Related