I am a newly self-taught user of R and require assistance.
I am working with a dataset that has captured location of residence and whether the locality is metropolitan, regional or rural over 7 years (2015-2021) for a subset of a population. Each individual has a unique ID and each year is on a new row (ie. each ID has 7 rows). I am trying to figure out how many individuals have remained in the same location, how many have moved and where they moved to.
I am really struggling to figure out what I need to do to get the required outputs, but I assume there is a way to get a summary table that has number of individuals who havent moved ( - where they are located) and number of individuals that have moved ( - where they have moved to).
Your assistance would be greatly appreciated.
Dummy dataset:
stack <- tribble(
~ID, ~Year, ~Residence, ~Locality,
#--/--/--/----
"a", "2015", "Sydney", "Metro",
"a", "2016", "Sydney", "Metro",
"a", "2017", "Sydney", "Metro",
"a", "2018", "Sydney", "Metro",
"a", "2019", "Sydney", "Metro",
"a", "2020", "Sydney", "Metro",
"a", "2021", "Sydney", "Metro",
"b", "2015", "Sydney", "Metro",
"b", "2016", "Orange", "Regional",
"b", "2017", "Orange", "Regional",
"b", "2018", "Orange", "Regional",
"b", "2019", "Orange", "Regional",
"b", "2020", "Broken Hill", "Rural",
"b", "2021", "Sydney", "Metro",
"c", "2015", "Dubbo", "Regional",
"c", "2016", "Dubbo", "Regional",
"c", "2017", "Dubbo", "Regional",
"c", "2018", "Dubbo", "Regional",
"c", "2019", "Dubbo", "Regional",
"c", "2020", "Dubbo", "Regional",
"c", "2021", "Dubbo", "Regional",
)
Cheers in advance.
CodePudding user response:
You can use the lead function to add columns containing the persons location in the following year. Using mutate across, you can apply the lead
to three columns simultaneously. You can then make a row-wise comparisons and look for moves.
#Group by individual before applying the lead function
#Apply the lead function to the three listed columns and add "nextyear" as a suffix
#Add a logical column which returns TRUE if any change of residence or locality is detected.
stack%>%
group_by(ID)%>%
mutate(across(c("Year", "Residence", "Locality"), list(nextyear=lead)))%>%
mutate(Residence.change=Residence!=Residence_nextyear,
Locality.change=Locality!=Locality_nextyear)
# A tibble: 21 x 9
# Groups: ID [3]
ID Year Residence Locality Year_nextyear Residence_nextyear Locality_nextyear Residence.change Locality.change
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl>
1 a 2015 Sydney Metro 2016 Sydney Metro FALSE FALSE
2 a 2016 Sydney Metro 2017 Sydney Metro FALSE FALSE
3 a 2017 Sydney Metro 2018 Sydney Metro FALSE FALSE
4 a 2018 Sydney Metro 2019 Sydney Metro FALSE FALSE
5 a 2019 Sydney Metro 2020 Sydney Metro FALSE FALSE
6 a 2020 Sydney Metro 2021 Sydney Metro FALSE FALSE
7 a 2021 Sydney Metro NA NA NA NA NA
8 b 2015 Sydney Metro 2016 Orange Regional TRUE TRUE
9 b 2016 Orange Regional 2017 Orange Regional FALSE FALSE
10 b 2017 Orange Regional 2018 Orange Regional FALSE FALSE
CodePudding user response:
Similar to what @Dealec did, I used the lag function from dplyr instead.
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
stack <- tribble(
~ID, ~Year, ~Residence, ~Locality,
#--/--/--/----
"a", "2015", "Sydney", "Metro",
"a", "2016", "Sydney", "Metro",
"a", "2017", "Sydney", "Metro",
"a", "2018", "Sydney", "Metro",
"a", "2019", "Sydney", "Metro",
"a", "2020", "Sydney", "Metro",
"a", "2021", "Sydney", "Metro",
"b", "2015", "Sydney", "Metro",
"b", "2016", "Orange", "Regional",
"b", "2017", "Orange", "Regional",
"b", "2018", "Orange", "Regional",
"b", "2019", "Orange", "Regional",
"b", "2020", "Broken Hill", "Rural",
"b", "2021", "Sydney", "Metro",
"c", "2015", "Dubbo", "Regional",
"c", "2016", "Dubbo", "Regional",
"c", "2017", "Dubbo", "Regional",
"c", "2018", "Dubbo", "Regional",
"c", "2019", "Dubbo", "Regional",
"c", "2020", "Dubbo", "Regional",
"c", "2021", "Dubbo", "Regional",
) %>%
clean_names()
results <- stack %>%
mutate(location = paste(residence, locality, sep = "_")) %>%
arrange(id, year) %>%
group_by(id) %>%
mutate(
row = row_number(),
movement = case_when(
row == 1 ~ NA_character_,
location == lag(location, n = 1) ~ "no_movement",
TRUE ~ location
)
) %>%
ungroup() %>%
select(-row)
results
#> # A tibble: 21 x 6
#> id year residence locality location movement
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a 2015 Sydney Metro Sydney_Metro <NA>
#> 2 a 2016 Sydney Metro Sydney_Metro no_movement
#> 3 a 2017 Sydney Metro Sydney_Metro no_movement
#> 4 a 2018 Sydney Metro Sydney_Metro no_movement
#> 5 a 2019 Sydney Metro Sydney_Metro no_movement
#> 6 a 2020 Sydney Metro Sydney_Metro no_movement
#> 7 a 2021 Sydney Metro Sydney_Metro no_movement
#> 8 b 2015 Sydney Metro Sydney_Metro <NA>
#> 9 b 2016 Orange Regional Orange_Regional Orange_Regional
#> 10 b 2017 Orange Regional Orange_Regional no_movement
#> # ... with 11 more rows
results %>%
count(year, movement) %>%
pivot_wider(names_from = movement,
values_from = n) %>%
clean_names()
#> # A tibble: 7 x 6
#> year na no_movement orange_regional broken_hill_rural sydney_metro
#> <chr> <int> <int> <int> <int> <int>
#> 1 2015 3 NA NA NA NA
#> 2 2016 NA 2 1 NA NA
#> 3 2017 NA 3 NA NA NA
#> 4 2018 NA 3 NA NA NA
#> 5 2019 NA 3 NA NA NA
#> 6 2020 NA 2 NA 1 NA
#> 7 2021 NA 2 NA NA 1
Created on 2022-04-22 by the reprex package (v2.0.1)