Home > Back-end >  Fetching values from one column based on other column keys in long-formatted dataset
Fetching values from one column based on other column keys in long-formatted dataset

Time:05-07

I have a long format dataset of 100,000 individuals, capturing clinic visits at 5 different time points (not chronological). I've included an example dataset below that replicates the formatting of my data:

  • ID: participant ID visit_number: order of the clinic visits in the original dataset

  • age_visit: age at the time of the visit

  • clinic_number: the identifier for the specific clinic location

  • age_sorted: For each ID, age sorted in ascending order across the 5 clinic visits

  • age_sorted_index: For each ID, the visit number corresponding to the sorted age

I would like to create a new column (clinic_number_extracted) that fetches the clinic identifier (clinic_number) corresponding to each sorted age (age_sorted value) for each participant. I was thinking that it might be possible to use the age_sorted_index and visit_number variables to do so (generating key-value pairs?), but am not quite sure how to do this outside of data.table. A tidyverse solution would be preferred.

I've looked on R community and stack exchange for clues, but haven't been able to find exactly what I'm looking for (likely not using the correct search terms). I tried to play around with group_by(across()) and with_order(order_by()) functions without much success. I can potentially create a new variable with a few case_when() conditions but might run into issues if there are repeated age_assessment_sorted values.

set.seed(42)

# Beginning dataset
das <- data.frame(id = rep(letters[1:3], each = 5), 
                  visit_number = rep(1:5, times = 3),
                  age_visit = c(50, rep(NA_real_, times = 7), 34, 40, 72, rep(NA_real_, times = 3), 87), 
                  clinic_number = sample(30:50, 15, replace=TRUE), 
                  age_sorted = c(50, rep(NA_real_, times = 4), 34, 40,rep(NA_real_, times = 3), 72, 87, rep(NA_real_, times = 3)), 
                  age_sorted_index = c(rep(1:5), 4, 5, rep(1:3), 1, 5, 2, 3, 4)) 

# Print out dataset
das
#>    id visit_number age_visit clinic_number age_sorted age_sorted_index
#> 1   a            1        50            46         50                1
#> 2   a            2        NA            34         NA                2
#> 3   a            3        NA            30         NA                3
#> 4   a            4        NA            39         NA                4
#> 5   a            5        NA            33         NA                5
#> 6   b            1        NA            47         34                4
#> 7   b            2        NA            46         40                5
#> 8   b            3        NA            44         NA                1
#> 9   b            4        34            36         NA                2
#> 10  b            5        40            33         NA                3
#> 11  c            1        72            34         72                1
#> 12  c            2        NA            43         87                5
#> 13  c            3        NA            49         NA                2
#> 14  c            4        NA            47         NA                3
#> 15  c            5        87            44         NA                4

Desired data:

das_final <- cbind(das, 
                  clinic_number_extracted = c(46, rep(NA_real_, times = 4), 36, 33, rep(NA_real_, times = 3), 34, 44, rep(NA_real_, times = 3)))

# Print out final dataset
das_final 
#>    id visit_number age_visit clinic_number age_sorted age_sorted_index
#> 1   a            1        50            46         50                1
#> 2   a            2        NA            34         NA                2
#> 3   a            3        NA            30         NA                3
#> 4   a            4        NA            39         NA                4
#> 5   a            5        NA            33         NA                5
#> 6   b            1        NA            47         34                4
#> 7   b            2        NA            46         40                5
#> 8   b            3        NA            44         NA                1
#> 9   b            4        34            36         NA                2
#> 10  b            5        40            33         NA                3
#> 11  c            1        72            34         72                1
#> 12  c            2        NA            43         87                5
#> 13  c            3        NA            49         NA                2
#> 14  c            4        NA            47         NA                3
#> 15  c            5        87            44         NA                4
#>    clinic_number_extracted
#> 1                       46
#> 2                       NA
#> 3                       NA
#> 4                       NA
#> 5                       NA
#> 6                       36
#> 7                       33
#> 8                       NA
#> 9                       NA
#> 10                      NA
#> 11                      34
#> 12                      44
#> 13                      NA
#> 14                      NA
#> 15                      NA

Created on 2022-05-06 by the reprex package (v2.0.1)

CodePudding user response:

We may use match between the 'age_sorted' and 'age_visit', use that index to subset the 'clinic_number' and replace the elements where they are NA in 'age_sorted' to NA

library(dplyr)
das_final %>% 
  group_by(id) %>%
  mutate(clinic_new = clinic_number[match(age_sorted, age_visit)] * 
      NA^is.na(age_sorted)) %>%
  ungroup

-output

# A tibble: 15 × 8
   id    visit_number age_visit clinic_number age_sorted age_sorted_index clinic_number_extracted clinic_new
   <chr>        <int>     <dbl>         <int>      <dbl>            <dbl>                   <dbl>      <dbl>
 1 a                1        50            46         50                1                      46         46
 2 a                2        NA            34         NA                2                      NA         NA
 3 a                3        NA            30         NA                3                      NA         NA
 4 a                4        NA            39         NA                4                      NA         NA
 5 a                5        NA            33         NA                5                      NA         NA
 6 b                1        NA            47         34                4                      36         36
 7 b                2        NA            46         40                5                      33         33
 8 b                3        NA            44         NA                1                      NA         NA
 9 b                4        34            36         NA                2                      NA         NA
10 b                5        40            33         NA                3                      NA         NA
11 c                1        72            34         72                1                      34         34
12 c                2        NA            43         87                5                      44         44
13 c                3        NA            49         NA                2                      NA         NA
14 c                4        NA            47         NA                3                      NA         NA
15 c                5        87            44         NA                4                      NA         NA

CodePudding user response:

Could you just do this without needing to group? It seems to solve the issue in the sample data, but perhaps I am missing something:

das$clinic_number_extracted <- ifelse(!is.na(das$age_sorted), das$clinic_number, NA)

Output:

#   id visit_number age_visit clinic_number age_sorted age_sorted_index clinic_number_extracted
1   a            1        50            46         50                1                      46
#2   a            2        NA            34         NA                2                      NA
#3   a            3        NA            30         NA                3                      NA
#4   a            4        NA            39         NA                4                      NA
#5   a            5        NA            33         NA                5                      NA
#6   b            1        NA            47         34                4                      47
#7   b            2        NA            46         40                5                      46
#8   b            3        NA            44         NA                1                      NA
#9   b            4        34            36         NA                2                      NA
#10  b            5        40            33         NA                3                      NA
#11  c            1        72            34         72                1                      34
#12  c            2        NA            43         87                5                      43
#13  c            3        NA            49         NA                2                      NA
#14  c            4        NA            47         NA                3                      NA
#15  c            5        87            44         NA                4                      NA

Testing against desired data provided:

all.equal(das[,"clinic_number_extracted"], das_final[,"clinic_number_extracted"])

#[1] TRUE
  • Related