I work with old census data and need to define the family occupation that should equal that of the head of the household. My solution was to combine the parish number and the house number into a unique family identifier, and then run a function that does the job. It works fine, but on the actual dataset, it is very slow. I've created a test database and a simplified function below. Is there a more efficient alternative for what I'm doing here?
test<- NULL
test$RecID <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
test$Parish <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2)
test$House <- c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2)
test$Relation <- c("Head", "Wife", "Child", "Head", "Child", "Child", "Child", "Head", "Servant",
"Head", "Child", 'Child', 'Head', 'Child', 'Servant')
test$Rela <- c(1,2,3,1,3,3,3,1,4,1,3,3,1,3,4)
test$Occode <- c(1,2,5,2,1,3,4,4,4,1,1,3,2,4,1)
test <- as.data.frame(test)
# solution
test$hh <- paste(test$Parish, '.', test$House)
test$hh <- test$hh %>% str_replace_all(("\\s*"), "")
occup <- function(x){
house <- test %>% filter(hh == x)
head <- house %>% filter(Rela == 1)
occ <- head$Occode
house %>% mutate(H_Occ = occ)
}
list <- test$hh %>% unique()
bind_rows(lapply(list, occup))
CodePudding user response:
We could join the data to a version of itself that just "sees" the Parish, House, and Occupation of the Head of household. Filters and joins are fast so this should be faster than the original approach, especially on larger data.
library(dplyr)
test %>%
left_join(test %>%
filter(Relation == "Head") %>%
select(Parish, House, Occode_head = Occode)
)
Result
Joining, by = c("Parish", "House")
RecID Parish House Relation Rela Occode hh Occode_head
1 1 1 1 Head 1 1 1.1 1
2 2 1 1 Wife 2 2 1.1 1
3 3 1 1 Child 3 5 1.1 1
4 4 1 2 Head 1 2 1.2 2
5 5 1 2 Child 3 1 1.2 2
6 6 1 2 Child 3 3 1.2 2
7 7 1 3 Child 3 4 1.3 4
8 8 1 3 Head 1 4 1.3 4
9 9 1 3 Servant 4 4 1.3 4
10 10 2 1 Head 1 1 2.1 1
11 11 2 1 Child 3 1 2.1 1
12 12 2 1 Child 3 3 2.1 1
13 13 2 2 Head 1 2 2.2 2
14 14 2 2 Child 3 4 2.2 2
15 15 2 2 Servant 4 1 2.2 2
CodePudding user response:
I'm not sure how much faster this is, but:
library(dplyr)
test %>%
group_by(Parish, House) %>%
mutate(H_occ = Occode[Relation == "Head"])