Summarising movements of individuals spread over several rows-CodePudding

I am a newly self-taught user of R and require assistance.

I am working with a dataset that has captured location of residence and whether the locality is metropolitan, regional or rural over 7 years (2015-2021) for a subset of a population. Each individual has a unique ID and each year is on a new row (ie. each ID has 7 rows). I am trying to figure out how many individuals have remained in the same location, how many have moved and where they moved to.

I am really struggling to figure out what I need to do to get the required outputs, but I assume there is a way to get a summary table that has number of individuals who havent moved ( - where they are located) and number of individuals that have moved ( - where they have moved to).

Your assistance would be greatly appreciated.

Dummy dataset:

stack <-  tribble(
    ~ID, ~Year, ~Residence, ~Locality,
    #--/--/--/----
    "a", "2015", "Sydney", "Metro",
    "a", "2016", "Sydney", "Metro",
    "a", "2017", "Sydney", "Metro",
    "a", "2018", "Sydney", "Metro",
    "a", "2019", "Sydney", "Metro",
    "a", "2020", "Sydney", "Metro",
    "a", "2021", "Sydney", "Metro",
    "b", "2015", "Sydney", "Metro",
    "b", "2016", "Orange", "Regional",
    "b", "2017", "Orange", "Regional",
    "b", "2018", "Orange", "Regional",
    "b", "2019", "Orange", "Regional",
    "b", "2020", "Broken Hill", "Rural",
    "b", "2021", "Sydney", "Metro",
    "c", "2015", "Dubbo", "Regional",
    "c", "2016", "Dubbo", "Regional",
    "c", "2017", "Dubbo", "Regional",
    "c", "2018", "Dubbo", "Regional",
    "c", "2019", "Dubbo", "Regional",
    "c", "2020", "Dubbo", "Regional",
    "c", "2021", "Dubbo", "Regional",
  )

Cheers in advance.

CodePudding user response：

You can use the lead function to add columns containing the persons location in the following year. Using mutate across, you can apply the lead to three columns simultaneously. You can then make a row-wise comparisons and look for moves.

#Group by individual before applying the lead function
#Apply the lead function to the three listed columns and add "nextyear" as a suffix
#Add a logical column which returns TRUE if any change of residence or locality is detected.

stack%>%
group_by(ID)%>%
mutate(across(c("Year", "Residence", "Locality"), list(nextyear=lead)))%>%
mutate(Residence.change=Residence!=Residence_nextyear,
Locality.change=Locality!=Locality_nextyear)

# A tibble: 21 x 9
# Groups:   ID [3]
   ID    Year  Residence Locality Year_nextyear Residence_nextyear Locality_nextyear Residence.change Locality.change
   <chr> <chr> <chr>     <chr>    <chr>         <chr>              <chr>             <lgl>            <lgl>          
 1 a     2015  Sydney    Metro    2016          Sydney             Metro             FALSE            FALSE          
 2 a     2016  Sydney    Metro    2017          Sydney             Metro             FALSE            FALSE          
 3 a     2017  Sydney    Metro    2018          Sydney             Metro             FALSE            FALSE          
 4 a     2018  Sydney    Metro    2019          Sydney             Metro             FALSE            FALSE          
 5 a     2019  Sydney    Metro    2020          Sydney             Metro             FALSE            FALSE          
 6 a     2020  Sydney    Metro    2021          Sydney             Metro             FALSE            FALSE          
 7 a     2021  Sydney    Metro    NA            NA                 NA                NA               NA             
 8 b     2015  Sydney    Metro    2016          Orange             Regional          TRUE             TRUE           
 9 b     2016  Orange    Regional 2017          Orange             Regional          FALSE            FALSE          
10 b     2017  Orange    Regional 2018          Orange             Regional          FALSE            FALSE

CodePudding user response：

Similar to what @Dealec did, I used the lag function from dplyr instead.

library(tidyverse)
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test

stack <-  tribble(
  ~ID, ~Year, ~Residence, ~Locality,
  #--/--/--/----
  "a", "2015", "Sydney", "Metro",
  "a", "2016", "Sydney", "Metro",
  "a", "2017", "Sydney", "Metro",
  "a", "2018", "Sydney", "Metro",
  "a", "2019", "Sydney", "Metro",
  "a", "2020", "Sydney", "Metro",
  "a", "2021", "Sydney", "Metro",
  "b", "2015", "Sydney", "Metro",
  "b", "2016", "Orange", "Regional",
  "b", "2017", "Orange", "Regional",
  "b", "2018", "Orange", "Regional",
  "b", "2019", "Orange", "Regional",
  "b", "2020", "Broken Hill", "Rural",
  "b", "2021", "Sydney", "Metro",
  "c", "2015", "Dubbo", "Regional",
  "c", "2016", "Dubbo", "Regional",
  "c", "2017", "Dubbo", "Regional",
  "c", "2018", "Dubbo", "Regional",
  "c", "2019", "Dubbo", "Regional",
  "c", "2020", "Dubbo", "Regional",
  "c", "2021", "Dubbo", "Regional",
) %>% 
  clean_names()

results <- stack %>%
  mutate(location = paste(residence, locality, sep = "_")) %>%
  arrange(id, year) %>%
  group_by(id) %>%
  mutate(
    row = row_number(),
    movement = case_when(
      row == 1 ~ NA_character_,
      location == lag(location, n = 1) ~ "no_movement",
      TRUE ~ location
    )
  ) %>% 
  ungroup() %>% 
  select(-row)

results
#> # A tibble: 21 x 6
#>    id    year  residence locality location        movement       
#>    <chr> <chr> <chr>     <chr>    <chr>           <chr>          
#>  1 a     2015  Sydney    Metro    Sydney_Metro    <NA>           
#>  2 a     2016  Sydney    Metro    Sydney_Metro    no_movement    
#>  3 a     2017  Sydney    Metro    Sydney_Metro    no_movement    
#>  4 a     2018  Sydney    Metro    Sydney_Metro    no_movement    
#>  5 a     2019  Sydney    Metro    Sydney_Metro    no_movement    
#>  6 a     2020  Sydney    Metro    Sydney_Metro    no_movement    
#>  7 a     2021  Sydney    Metro    Sydney_Metro    no_movement    
#>  8 b     2015  Sydney    Metro    Sydney_Metro    <NA>           
#>  9 b     2016  Orange    Regional Orange_Regional Orange_Regional
#> 10 b     2017  Orange    Regional Orange_Regional no_movement    
#> # ... with 11 more rows

results %>% 
  count(year, movement) %>% 
  pivot_wider(names_from = movement,
              values_from = n) %>% 
  clean_names()
#> # A tibble: 7 x 6
#>   year     na no_movement orange_regional broken_hill_rural sydney_metro
#>   <chr> <int>       <int>           <int>             <int>        <int>
#> 1 2015      3          NA              NA                NA           NA
#> 2 2016     NA           2               1                NA           NA
#> 3 2017     NA           3              NA                NA           NA
#> 4 2018     NA           3              NA                NA           NA
#> 5 2019     NA           3              NA                NA           NA
#> 6 2020     NA           2              NA                 1           NA
#> 7 2021     NA           2              NA                NA            1

^{Created on 2022-04-22 by the reprex package (v2.0.1)}