I'm new to R and really struggling with what feels like a straightforward problem (that I haven't been able to find answers for).
I have a relatively big data table that essentially includes -people -where they live -what they do -move-in dates -move-out dates. My goal is to derive a running weekly census table with each week as a row, and a column for each occupation and city, populated with the headcount at that time.
#MRE
library(tidyverse)
library(lubridate)
data <- data.frame(
first_names = c("joe", "sally", "bob", "frank", "susy"),
move_in = as.Date(c("2020-01-01", "2021-01-04", "2020-04-01", "2018-12-20", "2019-10-12")),
move_out = as.Date(c("2021-01-01", NA, "2021-10-01", NA, NA)),
city = c("Denver", "Phoenix", "Austin", "Denver", "Seattle"),
occupation = c("doctor", "doctor", "architect", "teacher", "teacher"))
#what I've tried :
cities = unique(data$city)[!is.na(unique(data$city))]
occupations = unique(data$occupation)[!is.na(unique(data$occupation))]
weeks <- (date = seq(from = as.Date("2020-12-27"), to = as.Date(today()), by="1 week"))
census <- matrix(data=NA, nrows=44, ncols=12)
for (i in seq(cities)){
for (j in seq(occupations)){
count <- data %>%
filter(cities == i) %>%
filter(occupations == j) %>%
sapply(weeks, function(x)
sum(
((as.Date(data$move_in)) <= as.Date(x) &
(as.Date(data$move_out)) > as.Date(x))|
((as.Date(data$move_in)) <= as.Date(x) &
is.na(data$move_out))))
census[j,x] <- count
}}
Any help is greatly appreciated!
CodePudding user response:
Here's a possible solution using some tidyverse verbs, since you loaded that package. We loop over the weeks you're interested in using the map_dfr
function, and for each week we collect a subset of the people who are there using your logic statement above. Then, we can use group_by
to skip the double outer loop and count
them directly. Finally, we mutate
a new column for week to keep them straight after they're bound together. Outside of the loop, we then pivot_wider
to get the one-column-per-occupation and one-row-per-week format that you're looking for.
library(tidyverse)
data <- data.frame(
first_names = c("joe", "sally", "bob", "frank", "susy"),
move_in = as.Date(c("2020-01-01", "2021-01-04", "2020-04-01", "2018-12-20", "2019-10-12")),
move_out = as.Date(c("2021-01-01", NA, "2021-10-01", NA, NA)),
city = c("Denver", "Phoenix", "Austin", "Denver", "Seattle"),
occupation = c("doctor", "doctor", "architect", "teacher", "teacher"))
# Avoid needing to load lubridate by using Sys.Date() instead of today()
weeks <- (date = seq(from = as.Date("2020-12-27"), to = as.Date(Sys.Date()), by="1 week"))
map_dfr(weeks, function(week_i){
data %>%
filter(move_in<week_i & move_out > week_i | move_in < week_i & is.na(move_out)) %>%
group_by(city, occupation) %>%
count() %>%
mutate(week=week_i)
}) %>%
pivot_wider(values_from = n, names_from = occupation, values_fill = 0)
which returns
# A tibble: 170 x 5
# Groups: city [4]
city week architect doctor teacher
<chr> <date> <int> <int> <int>
1 Austin 2020-12-27 1 0 0
2 Denver 2020-12-27 0 1 1
3 Seattle 2020-12-27 0 0 1
4 Austin 2021-01-03 1 0 0
5 Denver 2021-01-03 0 0 1
6 Seattle 2021-01-03 0 0 1
7 Austin 2021-01-10 1 0 0
8 Denver 2021-01-10 0 0 1
9 Phoenix 2021-01-10 0 1 0
10 Seattle 2021-01-10 0 0 1
# ... with 160 more rows
It looks like you're getting errors due to a couple typos. You're using the filter
verb to ask for the cities
column, but data only has a city
column in the sample data set. Same for occupations
vs occupation
. Good to keep in mind for the future, but great first effort and nicely provided example!
CodePudding user response:
I used data.table. lubridate
isn't needed, I used Sys.Date().
I made census a data.table also, instead of a matrix.
data.table::CJ is pretty much the same as expand.grid.
Then used mapply instead of for loops.
Finally, re-organized from long to wide, as I think that's what you wanted.
I left in all the city_occupation combinations - not sure if that was the intent.
library(data.table)
library(magrittr)
data <- data.frame(
first_names = c("joe", "sally", "bob", "frank", "susy"),
move_in = as.Date(c("2020-01-01", "2021-01-04", "2020-04-01", "2018-12-20", "2019-10-12")),
move_out = as.Date(c("2021-01-01", NA, "2021-10-01", NA, NA)),
city = c("Denver", "Phoenix", "Austin", "Denver", "Seattle"),
occupation = c("doctor", "doctor", "architect", "teacher", "teacher"))
cities <- unique(data$city)[!is.na(unique(data$city))]
occupations <- unique(data$occupation)[!is.na(unique(data$occupation))]
weeks <- (date = seq(from = as.Date("2020-12-27"), to = Sys.Date(), by="1 week"))
data %>% setDT()
census <- CJ(week = weeks, city = cities, occupation = occupations) %>%
.[, count := mapply(function(wk, cty, occ) {
data[city == cty & occupation == occ,
sum(move_in <= wk & (move_out > wk | is.na(move_out)))]
}, week, city, occupation)]
census %<>% dcast(week ~ city occupation, value.var = 'count')
Gives:
census
week Austin_architect Austin_doctor Austin_teacher Denver_architect
1: 2020-12-27 1 0 0 0
2: 2021-01-03 1 0 0 0
3: 2021-01-10 1 0 0 0
4: 2021-01-17 1 0 0 0
5: 2021-01-24 1 0 0 0
6: 2021-01-31 1 0 0 0
7: 2021-02-07 1 0 0 0
8: 2021-02-14 1 0 0 0
9: 2021-02-21 1 0 0 0
10: 2021-02-28 1 0 0 0
11: 2021-03-07 1 0 0 0
12: 2021-03-14 1 0 0 0
13: 2021-03-21 1 0 0 0
14: 2021-03-28 1 0 0 0
15: 2021-04-04 1 0 0 0
16: 2021-04-11 1 0 0 0
17: 2021-04-18 1 0 0 0
18: 2021-04-25 1 0 0 0
19: 2021-05-02 1 0 0 0
20: 2021-05-09 1 0 0 0
21: 2021-05-16 1 0 0 0
22: 2021-05-23 1 0 0 0
23: 2021-05-30 1 0 0 0
24: 2021-06-06 1 0 0 0
25: 2021-06-13 1 0 0 0
26: 2021-06-20 1 0 0 0
27: 2021-06-27 1 0 0 0
28: 2021-07-04 1 0 0 0
29: 2021-07-11 1 0 0 0
30: 2021-07-18 1 0 0 0
31: 2021-07-25 1 0 0 0
32: 2021-08-01 1 0 0 0
33: 2021-08-08 1 0 0 0
34: 2021-08-15 1 0 0 0
35: 2021-08-22 1 0 0 0
36: 2021-08-29 1 0 0 0
37: 2021-09-05 1 0 0 0
38: 2021-09-12 1 0 0 0
39: 2021-09-19 1 0 0 0
40: 2021-09-26 1 0 0 0
41: 2021-10-03 0 0 0 0
42: 2021-10-10 0 0 0 0
43: 2021-10-17 0 0 0 0
44: 2021-10-24 0 0 0 0
week Austin_architect Austin_doctor Austin_teacher Denver_architect
Denver_doctor Denver_teacher Phoenix_architect Phoenix_doctor
1: 1 1 0 0
2: 0 1 0 0
3: 0 1 0 1
4: 0 1 0 1
5: 0 1 0 1
6: 0 1 0 1
7: 0 1 0 1
8: 0 1 0 1
9: 0 1 0 1
10: 0 1 0 1
11: 0 1 0 1
12: 0 1 0 1
13: 0 1 0 1
14: 0 1 0 1
15: 0 1 0 1
16: 0 1 0 1
17: 0 1 0 1
18: 0 1 0 1
19: 0 1 0 1
20: 0 1 0 1
21: 0 1 0 1
22: 0 1 0 1
23: 0 1 0 1
24: 0 1 0 1
25: 0 1 0 1
26: 0 1 0 1
27: 0 1 0 1
28: 0 1 0 1
29: 0 1 0 1
30: 0 1 0 1
31: 0 1 0 1
32: 0 1 0 1
33: 0 1 0 1
34: 0 1 0 1
35: 0 1 0 1
36: 0 1 0 1
37: 0 1 0 1
38: 0 1 0 1
39: 0 1 0 1
40: 0 1 0 1
41: 0 1 0 1
42: 0 1 0 1
43: 0 1 0 1
44: 0 1 0 1
Denver_doctor Denver_teacher Phoenix_architect Phoenix_doctor
Phoenix_teacher Seattle_architect Seattle_doctor Seattle_teacher
1: 0 0 0 1
2: 0 0 0 1
3: 0 0 0 1
4: 0 0 0 1
5: 0 0 0 1
6: 0 0 0 1
7: 0 0 0 1
8: 0 0 0 1
9: 0 0 0 1
10: 0 0 0 1
11: 0 0 0 1
12: 0 0 0 1
13: 0 0 0 1
14: 0 0 0 1
15: 0 0 0 1
16: 0 0 0 1
17: 0 0 0 1
18: 0 0 0 1
19: 0 0 0 1
20: 0 0 0 1
21: 0 0 0 1
22: 0 0 0 1
23: 0 0 0 1
24: 0 0 0 1
25: 0 0 0 1
26: 0 0 0 1
27: 0 0 0 1
28: 0 0 0 1
29: 0 0 0 1
30: 0 0 0 1
31: 0 0 0 1
32: 0 0 0 1
33: 0 0 0 1
34: 0 0 0 1
35: 0 0 0 1
36: 0 0 0 1
37: 0 0 0 1
38: 0 0 0 1
39: 0 0 0 1
40: 0 0 0 1
41: 0 0 0 1
42: 0 0 0 1
43: 0 0 0 1
44: 0 0 0 1
Phoenix_teacher Seattle_architect Seattle_doctor Seattle_teacher