I have this dataset of almost half a million rows. To summarize, it looks like this
ages totals year sex subcounty county
0-4 500 2015 M subcounty1 county1
0 100 2015 M subcounty1 county1
1 150 2015 M subcounty1 county1
2 50 2015 M subcounty1 county1
3 100 2015 M subcounty1 county1
4 100 2015 M subcounty1 county1
5-9 1000 2015 M subcounty1 county1
5 300 2015 M subcounty1 county1
6 200 2015 M subcounty1 county1
7 200 2015 M subcounty1 county1
8 200 2015 M subcounty1 county1
9 50 2015 M subcounty1 county1
0-4 2000 2017 F subcounty1 county1
0 700 2017 F subcounty1 county1
1 250 2017 F subcounty1 county1
2 300 2017 F subcounty1 county1
3 250 2017 F subcounty1 county1
4 500 2017 F subcounty1 county1
the individual ages and age groups repeat so on for year,sex, sub county and county. The source of the dataset is not reliable so I want to confirm that the sums of the totals actually correspond to each other. Basically, I want to take the totals sum of 0,1,2,3,4 and compare it to that corresponding total for 0-4. I want to repeat this for every single age group for each year,sex,subcounty and county (it's a large dataset). I usually use dplyr but not sure if I should use another package for this.
CodePudding user response:
This should filter out any groups with issues. It assumes the first row of each age group has the range in ages
, and looks for a "-"
character in ages
to tell when a new group starts---you may need to adjust that logic if your data doesn't always match that pattern (e.g., if you have an "85 " age group you may need to change the regex pattern to "[- ]"
and set fixed = FALSE
).
df %>%
mutate(age_group = ifelse(grepl("-", ages, fixed = TRUE), ages, NA)) %>%
tidyr::fill(age_group, .direction = "down") %>%
group_by(age_group, year, sex, subcounty, county) %>%
mutate(
check_total = ifelse(grepl("-", ages, fixed = TRUE), sum(totals) - totals, totals)
) %>%
filter(any(check_total != totals))
# # A tibble: 6 × 8
# # Groups: age_group, year, sex, subcounty, county [1]
# ages totals year sex subcounty county age_group check_total
# <chr> <int> <int> <chr> <chr> <chr> <chr> <int>
# 1 5-9 1000 2015 M subcounty1 county1 5-9 950
# 2 5 300 2015 M subcounty1 county1 5-9 300
# 3 6 200 2015 M subcounty1 county1 5-9 200
# 4 7 200 2015 M subcounty1 county1 5-9 200
# 5 8 200 2015 M subcounty1 county1 5-9 200
# 6 9 50 2015 M subcounty1 county1 5-9 50