I have a dataframe that currently looks like this:
subjectID | Trial | Score |
---|---|---|
1 | 1 | 16 |
1 | 1 | 16 |
1 | 1 | 16 |
1 | 2 | 8 |
1 | 2 | 8 |
1 | 2 | 8 |
1 | 3 | 12 |
1 | 3 | 12 |
1 | 3 | 12 |
2 | 1 | 9 |
2 | 1 | 9 |
2 | 1 | 9 |
2 | 2 | 10 |
2 | 2 | 10 |
2 | 2 | 10 |
I need to create a new column, Previous_Trial_Score
, that is simply the score on the last trial for each person. For example:
subjectID | Trial | Score | Previous_Trial_Score |
---|---|---|---|
1 | 1 | 16 | NA |
1 | 1 | 16 | NA |
1 | 1 | 16 | NA |
1 | 2 | 8 | 16 |
1 | 2 | 8 | 16 |
1 | 2 | 8 | 16 |
1 | 3 | 12 | 8 |
1 | 3 | 12 | 8 |
1 | 3 | 12 | 8 |
2 | 1 | 9 | NA |
2 | 1 | 9 | NA |
2 | 1 | 9 | NA |
2 | 2 | 10 | 9 |
2 | 2 | 10 | 9 |
2 | 2 | 10 | 9 |
And so on. Trial 1 for each subject will always be NA
, as there is no previous trial for that person. I am writing a for-loop to accomplish this, below:
for (myperson in unique(data$subjectID)){
for (mytrial in unique(data$Trial[data$Trial>1])){
#Specify the trial and person
Prev_Score=as.numeric(unique(data[data$subjectID==myperson & data$Trial==mytrial-1, "Score"]))
#Save it to the dataframe
data[data$subjectID==myperson & data$Trial==mytrial,"Prev_Score"]=Prev_Score
}
}
In the above loop, I had to specify as.numeric
and unique
to get R to return a single value properly. However, when I run the loop, I get this error:
Error: Assigned data `value` must be compatible with existing data.
i Error occurred for column `Prev_Score`.
x Can't convert from <double> to <logical> due to loss of precision.
* Locations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 3...
Does anyone have a solution? I am open to tidyverse/dplyr work-arounds.
CodePudding user response:
We could do a group by summarise to get the lag
and then do a join
library(dplyr)
df1 %>%
group_by(subjectID, Trial) %>%
summarise(Previous_Trial_Score = last(Score), .groups= 'drop_last') %>%
mutate(Previous_Trial_Score = lag(Previous_Trial_Score)) %>%
left_join(df1, .)
-output
subjectID Trial Score Previous_Trial_Score
1 1 1 16 NA
2 1 1 16 NA
3 1 1 16 NA
4 1 2 8 16
5 1 2 8 16
6 1 2 8 16
7 1 3 12 8
8 1 3 12 8
9 1 3 12 8
10 2 1 9 NA
11 2 1 9 NA
12 2 1 9 NA
13 2 2 10 9
14 2 2 10 9
15 2 2 10 9
Or slightly compact option with data.table
library(data.table)
setDT(df1)[, Previous_Trial_Score := shift(.SD[, last(Score),
Trial]$V1)[Trial], subjectID]
-output
> df1
subjectID Trial Score Previous_Trial_Score
<int> <int> <int> <int>
1: 1 1 16 NA
2: 1 1 16 NA
3: 1 1 16 NA
4: 1 2 8 16
5: 1 2 8 16
6: 1 2 8 16
7: 1 3 12 8
8: 1 3 12 8
9: 1 3 12 8
10: 2 1 9 NA
11: 2 1 9 NA
12: 2 1 9 NA
13: 2 2 10 9
14: 2 2 10 9
15: 2 2 10 9
data
df1 <- structure(list(subjectID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), Trial = c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L), Score = c(16L, 16L, 16L,
8L, 8L, 8L, 12L, 12L, 12L, 9L, 9L, 9L, 10L, 10L, 10L)),
class = "data.frame", row.names = c(NA,
-15L))
CodePudding user response:
Here is another dplyr
option, where we just keep the unique rows with distinct
, then get the lag
value of Score
, and join back to the original dataframe. However, if you have more columns in your dataframe, then @akrun provides a cleaner approach of handling the additional columns.
library(dplyr)
df %>%
distinct() %>%
group_by(subjectID) %>%
mutate(Previous_Trial_Score = lag(Score)) %>%
left_join(df, .)
Output
subjectID Trial Score Previous_Trial_Score
1 1 1 16 NA
2 1 1 16 NA
3 1 1 16 NA
4 1 2 8 16
5 1 2 8 16
6 1 2 8 16
7 1 3 12 8
8 1 3 12 8
9 1 3 12 8
10 2 1 9 NA
11 2 1 9 NA
12 2 1 9 NA
13 2 2 10 9
14 2 2 10 9
15 2 2 10 9
Data
df <- structure(list(subjectID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), Trial = c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L), Score = c(16L, 16L, 16L,
8L, 8L, 8L, 12L, 12L, 12L, 9L, 9L, 9L, 10L, 10L, 10L)), class = "data.frame", row.names = c(NA,
-15L))