Based on the data below how can I sum
the columns Inflow
, Outflow
, NetMigration
, InAGI
and OutAGI
by FIPS
across two time periods 2011-2015
and 2016-2020
? Some counties might not have the data for a particular fiscal year but, that does not matter since the idea is to sum the data within these two time period ranges. Naturally some NAs
are to be expected in the final dataset. I am using FIPS
because there are several counties with the same name. So, the column Key
is no longer needed as it's a concatenate of FIPS
and Year
.
Desired output schema/columns:
FIPS County State TotInflow TotOutflow TotNetMigration TotInAGI TotOutAGI Time_Period
12001 Alachua County FL 2011-2015
12001 Alachua County FL 2016-2020
08001 Adams County CO 2011-2015
08001 Adams County CO 2016-2020
Sample data:
df = structure(list(Key = c("080012020", "120012020", "120012018",
"120012017", "080012017", "120012016", "120012015", "080012014",
"120012013", "120012012", "080012012", "080012011", "080012016"
), County = c("Adams County", "Alachua County", "Alachua County",
"Alachua County", "Adams County", "Alachua County", "Alachua County",
"Adams County", "Alachua County", "Alachua County", "Adams County",
"Adams County", "Adams County"), State = c("CO", "FL", "FL",
"FL", "CO", "FL", "FL", "CO", "FL", "FL", "CO", "CO", "CO"),
FIPS = c("08001", "12001", "12001", "12001", "08001", "12001",
"12001", "08001", "12001", "12001", "08001", "08001", "08001"
), Inflow = c(38L, 261L, 321L, 339L, 58L, 288L, 254L, 46L,
413L, 433L, 30L, 42L, NA), InAGI = c(1817L, 6287L, 8423L,
8364L, 1865L, 14720L, 5224L, 1074L, 11774L, 10151L, 921L,
500L, NA), FiscalYear = c("2019- 2020", "2019- 2020", "2017 - 2018",
"2016 - 2017", "2016 - 2017", "2015 - 2016", "2014 - 2015",
"2013 - 2014", "2012 - 2013", "2011 - 2012", "2011 - 2012",
"2010 - 2011", "2015 - 2016"), Year = c(2020L, 2020L, 2018L,
2017L, 2017L, 2016L, 2015L, 2014L, 2013L, 2012L, 2012L, 2011L,
2016L), Outflow = c(54L, 447L, 444L, 558L, 44L, 436L, 334L,
49L, 466L, 495L, 39L, 31L, 51L), OutAGI = c(1879L, 13106L,
15409L, 16496L, 2408L, 12675L, 7448L, 733L, 10309L, 11677L,
847L, 605L, 1114L), NetMigration = c(-16L, -186L, -123L,
-219L, 14L, -148L, -80L, -3L, -53L, -62L, -9L, 11L, NA)), row.names = c(NA,
-13L), class = "data.frame")
CodePudding user response:
Do the grouping by 'FIPS', 'County', 'State' and a Time_Period
column created from 'Year' based on whether the 'Year' lies between some start, end year, and then get the sum
of the columns interested by looping across
those column names
library(dplyr)
df %>%
group_by(FIPS, County, State,
Time_Period = case_when(between(Year, 2011, 2015)~
'2011-2015', between(Year, 2016, 2020)~ '2016-2020')) %>%
summarise(across(c(Inflow, InAGI, Outflow, OutAGI, NetMigration),
~ sum(.x, na.rm = TRUE), .names = "Total{.col}"),.groups = "drop")
-output
# A tibble: 4 × 9
FIPS County State Time_Period TotalInflow TotalInAGI TotalOutflow TotalOutAGI TotalNetMigration
<chr> <chr> <chr> <chr> <int> <int> <int> <int> <int>
1 08001 Adams County CO 2011-2015 118 2495 119 2185 -1
2 08001 Adams County CO 2016-2020 96 3682 149 5401 -2
3 12001 Alachua County FL 2011-2015 1100 27149 1295 29434 -195
4 12001 Alachua County FL 2016-2020 1209 37794 1885 57686 -676