Home > Software engineering >  Comparing the sums of categorical and individual data in the same column in R
Comparing the sums of categorical and individual data in the same column in R

Time:03-04

I have this dataset of almost half a million rows. To summarize, it looks like this

ages   totals year sex subcounty county
0-4   500     2015 M   subcounty1 county1
0     100     2015 M   subcounty1 county1
1     150     2015 M   subcounty1 county1
2     50      2015 M   subcounty1 county1
3     100     2015 M   subcounty1 county1
4     100     2015 M   subcounty1 county1
5-9   1000    2015 M   subcounty1 county1
5     300     2015 M   subcounty1 county1
6     200     2015 M   subcounty1 county1
7     200     2015 M   subcounty1 county1
8     200     2015 M   subcounty1 county1
9     50      2015 M   subcounty1 county1
0-4   2000    2017 F   subcounty1 county1
0     700     2017 F   subcounty1 county1
1     250     2017 F   subcounty1 county1
2     300     2017 F   subcounty1 county1
3     250     2017 F   subcounty1 county1
4     500     2017 F   subcounty1 county1

the individual ages and age groups repeat so on for year,sex, sub county and county. The source of the dataset is not reliable so I want to confirm that the sums of the totals actually correspond to each other. Basically, I want to take the totals sum of 0,1,2,3,4 and compare it to that corresponding total for 0-4. I want to repeat this for every single age group for each year,sex,subcounty and county (it's a large dataset). I usually use dplyr but not sure if I should use another package for this.

CodePudding user response:

This should filter out any groups with issues. It assumes the first row of each age group has the range in ages, and looks for a "-" character in ages to tell when a new group starts---you may need to adjust that logic if your data doesn't always match that pattern (e.g., if you have an "85 " age group you may need to change the regex pattern to "[- ]" and set fixed = FALSE).

df %>%
  mutate(age_group = ifelse(grepl("-", ages, fixed = TRUE), ages, NA)) %>%
  tidyr::fill(age_group, .direction = "down") %>%
  group_by(age_group, year, sex, subcounty, county) %>%
  mutate(
    check_total = ifelse(grepl("-", ages, fixed = TRUE), sum(totals) - totals, totals)
  ) %>%
  filter(any(check_total != totals))
# # A tibble: 6 × 8
# # Groups:   age_group, year, sex, subcounty, county [1]
#   ages  totals  year sex   subcounty  county  age_group check_total
#   <chr>  <int> <int> <chr> <chr>      <chr>   <chr>           <int>
# 1 5-9     1000  2015 M     subcounty1 county1 5-9               950
# 2 5        300  2015 M     subcounty1 county1 5-9               300
# 3 6        200  2015 M     subcounty1 county1 5-9               200
# 4 7        200  2015 M     subcounty1 county1 5-9               200
# 5 8        200  2015 M     subcounty1 county1 5-9               200
# 6 9         50  2015 M     subcounty1 county1 5-9                50
  •  Tags:  
  • r
  • Related