Recoding a complex composite score in R-CodePudding

Assume my research concerns an observational longitudinal cohort study.

Let γ_comp be the composite outcome of interest and γ1....γ4 at time t1 and t2 denote components of γ_comp. In addition, the dataset has three other variables (χ1, χ2, and χ3) which will be used in future analysis but are not necessary to code γ_comp. Here is an excerpt of the data.frame

df <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 
                    Y1_t1 = c(5, 6, 10, 7, 5, 7, 5, 4, 7, 4), 
                    Y2_t1 = c(6, 4, 8, 8, 7, 10, 7, 6, 5, 7), 
                    Y3_t1 = c(5, 6, 10, 4, 8, 5, 10, 5, 4, 6), 
                    Y4_t1 = c(4.5, 8.5, 9.5, 4.5, 5, 8, 4.5, 8.5, 4, 6), 
                    Y1_t2 = c(6, 4, 5, 5, 3, 4, 8, 4, 3, 2), 
                    Y2_t2 = c(5, 4, 3, 6, 5, 5, 5, 2, 2, 8), 
                    Y3_t2 = c(2, 2, 4, 5, 4, 9, 5, 3, 2, 4), 
                    Y4_t2 = c(3.5, 6, 5, 5, 4.5, 4, 2.5, 7, 4.5, 4), 
                    X1 = c(40, 45, 52, 44, 42, 65, 55, 61, 52, 49), 
                    X2 = c("NL", "UK", "NL", "US", "UK", "US", "NL", "NL", "UK", "UK"), 
                    X3 = c(2000, 2005, 2003, 2000, 2001, 2002, 2003, 2004, 2001, 2000)), 
                    class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L))

Structure

spec_tbl_df [10 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID   : num [1:10] 1 2 3 4 5 6 7 8 9 10
 $ Y1_t1: num [1:10] 5 6 10 7 5 7 5 4 7 4
 $ Y2_t1: num [1:10] 6 4 8 8 7 10 7 6 5 7
 $ Y3_t1: num [1:10] 5 6 10 4 8 5 10 5 4 6
 $ Y4_t1: num [1:10] 4.5 8.5 9.5 4.5 5 8 4.5 8.5 4 6
 $ Y1_t2: num [1:10] 6 4 5 5 3 4 8 4 3 2
 $ Y2_t2: num [1:10] 5 4 3 6 5 5 5 2 2 8
 $ Y3_t2: num [1:10] 2 2 4 5 4 9 5 3 2 4
 $ Y4_t2: num [1:10] 3.5 6 5 5 4.5 4 2.5 7 4.5 4
 $ X1   : num [1:10] 40 45 52 44 42 65 55 61 52 49
 $ X2   : chr [1:10] "NL" "UK" "NL" "US" ...
 $ X3   : num [1:10] 2000 2005 2003 2000 2001 ...

As mentioned earlier, I am interested in calculating γ_comp. The rules for recording are as follows:

3 out of 4 components (i.e., γ1....γ4 must have more than 20% improvement (i.e. decrease) on numeric scale (0 - 10) [higher is worse] at t2 compared to t1).
In the "remaining component," there should be no worsening of more than 20% at t2 compared to t1

I believe the following steps have to be taken to achieve this aim. First, Y1_diff = Y1_t2/Y1_t1 must be calculated for every component. This is the proportion between two time points and should be <0.80. Next, an if_else condition has to be applied, which reinforces these rules and returns 1 if the rules are met and 0 if not (i.e., "responded" to treatment or not).

For example, this could be a desired output:

      ID Ycomp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2 Y1_diff Y2_diff Y3_diff Y4_diff    X1 X2       X3
 1     1     0     5     6     5   4.5     6     5     2   3.5    1.2     0.83    0.4     0.78    40 NL     2000
 2     2     1     6     4     6   8.5     4     4     2   6      0.67    1       0.33    0.71    45 UK     2005
 3     3     1    10     8    10   9.5     5     3     4   5      0.5     0.38    0.4     0.53    52 NL     2003
 4     4     0     7     8     4   4.5     5     6     5   5      0.71    0.75    1.25    1.11    44 US     2000
 5     5     1     5     7     8   5       3     5     4   4.5    0.6     0.71    0.5     0.9     42 UK     2001
 6     6     0     7    10     5   8       4     5     9   4      0.57    0.5     1.8     0.5     65 US     2002
 7     7     0     5     7    10   4.5     8     5     5   2.5    1.6     0.71    0.5     0.56    55 NL     2003
 8     8     0     4     6     5   8.5     4     2     3   7      1       0.33    0.6     0.82    61 NL     2004
 9     9     1     7     5     4   4       3     2     2   4.5    0.43    0.4     0.5     1.13    52 UK     2001
10    10     1     4     7     6   6       2     8     4   4      0.5     1.14    0.67    0.67    49 UK     2000

I would appreciate any advice on recoding the composite score γ_comp. Alternative methods are also welcome. The idea is to use γ_comp in logistic regression in future analysis.

CodePudding user response：

This should do it for you, assuming my understanding is correct:

inner_join(
  df, 
  df %>%
    select(ID,starts_with("Y")) %>% 
    pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
    pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
    mutate(change=1-t2/t1) %>% 
    group_by(ID) %>% 
    mutate(impct = sum(change>0.2)) %>% 
    summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
) %>% relocate(Y_comp,.after = ID)

Output:

      ID Y_comp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2    X1 X2       X3
   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
 1     1      0     5     6     5   4.5     6     5     2   3.5    40 NL     2000
 2     2      1     6     4     6   8.5     4     4     2   6      45 UK     2005
 3     3      1    10     8    10   9.5     5     3     4   5      52 NL     2003
 4     4      0     7     8     4   4.5     5     6     5   5      44 US     2000
 5     5      1     5     7     8   5       3     5     4   4.5    42 UK     2001
 6     6      0     7    10     5   8       4     5     9   4      65 US     2002
 7     7      0     5     7    10   4.5     8     5     5   2.5    55 NL     2003
 8     8      0     4     6     5   8.5     4     2     3   7      61 NL     2004
 9     9      1     7     5     4   4       3     2     2   4.5    52 UK     2001
10    10      1     4     7     6   6       2     8     4   4      49 UK     2000

Explanation:

This is an inner join between df, and a new dataframe that contains two columns ID and Y_comp. How is this second frame created?

I select the columns ID and those starting with "Y"
I pivot long, and the pivot wide to get the data into a format with four columns (ID, Y, t1, and t2).
On each row, I estimate the change as 1-t2/t1.
For each ID (group_by(ID)), I generate a column impt as the number of times change exceeds 0.2. This is constant over ID
For each ID, I define Y_comp as TRUE if all of the rows have impct==4 (i.e. all are improvements) OR, if three are improvements and the minimum in the set is not less than negative 0.2).
I multiply by 1 in that same line, to convert Y_comp to numeric 1/0, rather than T/F
After the join is completed, I move Y_comp after ID, using relocate()

Update:

The OP is having an error, likely caused by namespace collision; one solution is to be specific about the packages being used:

library(magrittr)
dplyr::inner_join(
  df, 
  df %>%
    dplyr::select(ID,starts_with("Y")) %>% 
    tidyr::pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
    tidyr::pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
    dplyr::mutate(change=1-t2/t1) %>% 
    dplyr::group_by(ID) %>% 
    dplyr::mutate(impct = sum(change>0.2)) %>% 
    dplyr::summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
) %>% dplyr::relocate(Y_comp,.after = ID)

CodePudding user response：

Insipired by the method of langtang, I found one possible solution to the problem:

df <- df %>% mutate(Y1_diff = 
                case_when( Y1_t2/ Y1_t1 < 0.8 ~ 1,
                           Y1_t2 == 0 ~ 0,
                           Y1_t2/ Y1_t1 >= 0.8 & Y1_t2/ Y1_t1 <=1.2 ~ 0, 
                           TRUE ~ -1)) %>%
  mutate(Y2_diff = 
           case_when( Y2_t2/ Y2_t1 < 0.8 ~ 1,
                      Y2_t2 == 0 ~ 0,
                      Y2_t2/ Y2_t1 >= 0.8 & Y2_t2/ Y2_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Y3_diff = 
           case_when( Y3_t2/ Y3_t1 < 0.8 ~ 1,
                      Y3_t2 == 0 ~ 0,
                      Y3_t2/ Y3_t1 >= 0.8 & Y3_t2/ Y3_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Y4_diff = 
           case_when( Y4_t2/ Y4_t1 < 0.8 ~ 1,
                      Y4_t2 == 0 ~ 0,
                      Y4_t2/ Y4_t1 >= 0.8 & Y4_t2/ Y4_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Ycomp = 
           case_when(Y1_diff Y2_diff Y3_diff Y4_diff >=3 ~ 1,
                     TRUE ~ 0))

Explanation

I am creating four variables first, which assess whether the relative difference was less than 0.8 (i.e., 20% improved), between 0.8-1.2, or worsened and was more than >1.2. In the case of improvement, these between variables (Yn_diff) were coded 1, 0 if in between, and -1 if worsened. I also looked if, at time t2, the variable output was zero and gave it a score of 0 because, in my real dataset, there were scenario's where both t1 and t2 were 0, which gives NaaN error. Finally, I added up all the variables, which gives the correct output in the variable Ycomp.

Output

      ID Ycomp Y1_t1 Y1_t2 Y2_t1 Y2_t2 Y3_t1 Y3_t2 Y4_t1 Y4_t2
 1     1     0     5     6     6     5     5     2   4.5   3.5
 2     2     1     6     4     4     4     6     2   8.5   6  
 3     3     1    10     5     8     3    10     4   9.5   5  
 4     4     0     7     5     8     6     4     5   4.5   5  
 5     5     1     5     3     7     5     8     4   5     4.5
 6     6     0     7     4    10     5     5     9   8     4  
 7     7     0     5     8     7     5    10     5   4.5   2.5
 8     8     0     4     4     6     2     5     3   8.5   7  
 9     9     1     7     3     5     2     4     2   4     4.5
10    10     1     4     2     7     8     6     4   6     4