Assume my research concerns an observational longitudinal cohort study.
Let γ_comp be the composite outcome of interest and γ1....γ4 at time t1 and t2 denote components of γ_comp. In addition, the dataset has three other variables (χ1, χ2, and χ3) which will be used in future analysis but are not necessary to code γ_comp. Here is an excerpt of the data.frame
df <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
Y1_t1 = c(5, 6, 10, 7, 5, 7, 5, 4, 7, 4),
Y2_t1 = c(6, 4, 8, 8, 7, 10, 7, 6, 5, 7),
Y3_t1 = c(5, 6, 10, 4, 8, 5, 10, 5, 4, 6),
Y4_t1 = c(4.5, 8.5, 9.5, 4.5, 5, 8, 4.5, 8.5, 4, 6),
Y1_t2 = c(6, 4, 5, 5, 3, 4, 8, 4, 3, 2),
Y2_t2 = c(5, 4, 3, 6, 5, 5, 5, 2, 2, 8),
Y3_t2 = c(2, 2, 4, 5, 4, 9, 5, 3, 2, 4),
Y4_t2 = c(3.5, 6, 5, 5, 4.5, 4, 2.5, 7, 4.5, 4),
X1 = c(40, 45, 52, 44, 42, 65, 55, 61, 52, 49),
X2 = c("NL", "UK", "NL", "US", "UK", "US", "NL", "NL", "UK", "UK"),
X3 = c(2000, 2005, 2003, 2000, 2001, 2002, 2003, 2004, 2001, 2000)),
class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L))
Structure
spec_tbl_df [10 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ID : num [1:10] 1 2 3 4 5 6 7 8 9 10
$ Y1_t1: num [1:10] 5 6 10 7 5 7 5 4 7 4
$ Y2_t1: num [1:10] 6 4 8 8 7 10 7 6 5 7
$ Y3_t1: num [1:10] 5 6 10 4 8 5 10 5 4 6
$ Y4_t1: num [1:10] 4.5 8.5 9.5 4.5 5 8 4.5 8.5 4 6
$ Y1_t2: num [1:10] 6 4 5 5 3 4 8 4 3 2
$ Y2_t2: num [1:10] 5 4 3 6 5 5 5 2 2 8
$ Y3_t2: num [1:10] 2 2 4 5 4 9 5 3 2 4
$ Y4_t2: num [1:10] 3.5 6 5 5 4.5 4 2.5 7 4.5 4
$ X1 : num [1:10] 40 45 52 44 42 65 55 61 52 49
$ X2 : chr [1:10] "NL" "UK" "NL" "US" ...
$ X3 : num [1:10] 2000 2005 2003 2000 2001 ...
As mentioned earlier, I am interested in calculating γ_comp. The rules for recording are as follows:
- 3 out of 4 components (i.e., γ1....γ4 must have more than 20% improvement (i.e. decrease) on numeric scale (0 - 10) [higher is worse] at t2 compared to t1).
- In the "remaining component," there should be no worsening of more than 20% at t2 compared to t1
I believe the following steps have to be taken to achieve this aim. First, Y1_diff = Y1_t2/Y1_t1
must be calculated for every component. This is the proportion between two time points and should be <0.80. Next, an if_else condition
has to be applied, which reinforces these rules and returns 1
if the rules are met and 0
if not (i.e., "responded" to treatment or not).
For example, this could be a desired output:
ID Ycomp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2 Y1_diff Y2_diff Y3_diff Y4_diff X1 X2 X3
1 1 0 5 6 5 4.5 6 5 2 3.5 1.2 0.83 0.4 0.78 40 NL 2000
2 2 1 6 4 6 8.5 4 4 2 6 0.67 1 0.33 0.71 45 UK 2005
3 3 1 10 8 10 9.5 5 3 4 5 0.5 0.38 0.4 0.53 52 NL 2003
4 4 0 7 8 4 4.5 5 6 5 5 0.71 0.75 1.25 1.11 44 US 2000
5 5 1 5 7 8 5 3 5 4 4.5 0.6 0.71 0.5 0.9 42 UK 2001
6 6 0 7 10 5 8 4 5 9 4 0.57 0.5 1.8 0.5 65 US 2002
7 7 0 5 7 10 4.5 8 5 5 2.5 1.6 0.71 0.5 0.56 55 NL 2003
8 8 0 4 6 5 8.5 4 2 3 7 1 0.33 0.6 0.82 61 NL 2004
9 9 1 7 5 4 4 3 2 2 4.5 0.43 0.4 0.5 1.13 52 UK 2001
10 10 1 4 7 6 6 2 8 4 4 0.5 1.14 0.67 0.67 49 UK 2000
I would appreciate any advice on recoding the composite score γ_comp. Alternative methods are also welcome. The idea is to use γ_comp in logistic regression in future analysis.
CodePudding user response:
This should do it for you, assuming my understanding is correct:
inner_join(
df,
df %>%
select(ID,starts_with("Y")) %>%
pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>%
pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>%
mutate(change=1-t2/t1) %>%
group_by(ID) %>%
mutate(impct = sum(change>0.2)) %>%
summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2)))
) %>% relocate(Y_comp,.after = ID)
Output:
ID Y_comp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2 X1 X2 X3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 0 5 6 5 4.5 6 5 2 3.5 40 NL 2000
2 2 1 6 4 6 8.5 4 4 2 6 45 UK 2005
3 3 1 10 8 10 9.5 5 3 4 5 52 NL 2003
4 4 0 7 8 4 4.5 5 6 5 5 44 US 2000
5 5 1 5 7 8 5 3 5 4 4.5 42 UK 2001
6 6 0 7 10 5 8 4 5 9 4 65 US 2002
7 7 0 5 7 10 4.5 8 5 5 2.5 55 NL 2003
8 8 0 4 6 5 8.5 4 2 3 7 61 NL 2004
9 9 1 7 5 4 4 3 2 2 4.5 52 UK 2001
10 10 1 4 7 6 6 2 8 4 4 49 UK 2000
Explanation:
This is an inner join between df
, and a new dataframe that contains two columns ID
and Y_comp
. How is this second frame created?
- I select the columns
ID
and those starting with "Y" - I pivot long, and the pivot wide to get the data into a format with four columns (ID, Y, t1, and t2).
- On each row, I estimate the change as 1-t2/t1.
- For each ID (
group_by(ID)
), I generate a columnimpt
as the number of timeschange
exceeds 0.2. This is constant overID
- For each ID, I define
Y_comp
as TRUE if all of the rows haveimpct==4
(i.e. all are improvements) OR, if three are improvements and the minimum in the set is not less than negative 0.2). - I multiply by 1 in that same line, to convert Y_comp to numeric 1/0, rather than T/F
- After the join is completed, I move Y_comp after ID, using
relocate()
Update:
The OP is having an error, likely caused by namespace collision; one solution is to be specific about the packages being used:
library(magrittr)
dplyr::inner_join(
df,
df %>%
dplyr::select(ID,starts_with("Y")) %>%
tidyr::pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>%
tidyr::pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>%
dplyr::mutate(change=1-t2/t1) %>%
dplyr::group_by(ID) %>%
dplyr::mutate(impct = sum(change>0.2)) %>%
dplyr::summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2)))
) %>% dplyr::relocate(Y_comp,.after = ID)
CodePudding user response:
Insipired by the method of langtang, I found one possible solution to the problem:
df <- df %>% mutate(Y1_diff =
case_when( Y1_t2/ Y1_t1 < 0.8 ~ 1,
Y1_t2 == 0 ~ 0,
Y1_t2/ Y1_t1 >= 0.8 & Y1_t2/ Y1_t1 <=1.2 ~ 0,
TRUE ~ -1)) %>%
mutate(Y2_diff =
case_when( Y2_t2/ Y2_t1 < 0.8 ~ 1,
Y2_t2 == 0 ~ 0,
Y2_t2/ Y2_t1 >= 0.8 & Y2_t2/ Y2_t1 <=1.2 ~ 0,
TRUE ~ -1)) %>%
mutate(Y3_diff =
case_when( Y3_t2/ Y3_t1 < 0.8 ~ 1,
Y3_t2 == 0 ~ 0,
Y3_t2/ Y3_t1 >= 0.8 & Y3_t2/ Y3_t1 <=1.2 ~ 0,
TRUE ~ -1)) %>%
mutate(Y4_diff =
case_when( Y4_t2/ Y4_t1 < 0.8 ~ 1,
Y4_t2 == 0 ~ 0,
Y4_t2/ Y4_t1 >= 0.8 & Y4_t2/ Y4_t1 <=1.2 ~ 0,
TRUE ~ -1)) %>%
mutate(Ycomp =
case_when(Y1_diff Y2_diff Y3_diff Y4_diff >=3 ~ 1,
TRUE ~ 0))
Explanation
I am creating four variables first, which assess whether the relative difference was less than 0.8 (i.e., 20% improved), between 0.8-1.2, or worsened and was more than >1.2. In the case of improvement, these between variables (Yn_diff) were coded 1, 0 if in between, and -1 if worsened. I also looked if, at time t2, the variable output was zero and gave it a score of 0 because, in my real dataset, there were scenario's where both t1 and t2 were 0, which gives NaaN error. Finally, I added up all the variables, which gives the correct output in the variable Ycomp.
Output
ID Ycomp Y1_t1 Y1_t2 Y2_t1 Y2_t2 Y3_t1 Y3_t2 Y4_t1 Y4_t2
1 1 0 5 6 6 5 5 2 4.5 3.5
2 2 1 6 4 4 4 6 2 8.5 6
3 3 1 10 5 8 3 10 4 9.5 5
4 4 0 7 5 8 6 4 5 4.5 5
5 5 1 5 3 7 5 8 4 5 4.5
6 6 0 7 4 10 5 5 9 8 4
7 7 0 5 8 7 5 10 5 4.5 2.5
8 8 0 4 4 6 2 5 3 8.5 7
9 9 1 7 3 5 2 4 2 4 4.5
10 10 1 4 2 7 8 6 4 6 4