How to Identify the Students with Grade Retention in the Dataset Using R or Stata?-CodePudding

I have a small dataset in which there are 4 data records about student academic progression. The fist column is id, and the second column is grade. My goal is to create a new variable called "GR" (Grade Retention) to identify the students with grade retention. If it is the case, then GR=1, otherwise, GR=0. The resulting dataset should look like below,

structure(list(id = c(1000, 1000, 1000, 1000, 1000, 1000, 1001, 
1001, 1001, 1001, 1001, 1001, 1001, 1002, 1002, 1002, 1002, 1002, 
1002, 1002, 1002, 1002, 1003, 1003, 1003, 1003, 1003, 1003), 
    grade = c("1", "2", "3", "4", "5", "6", "1", "2", "3", "4", 
    "5", "5", "6", "1", "2", "2", "3", "4", "4", "4", "5", "6", 
    "1", "2", "3", "4", "5", "6"), GR= c("0", "0", 
    "0", "0", "0", "0", "1", "1", "1", "1", "1", "1", "1", "1", 
    "1", "1", "1", "1", "1", "1", "1", "1", "0", "0", "0", "0", 
    "0", "0")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-28L))->df_new

The original data set is structured as follows,

structure(list(id = c(1000, 1000, 1000, 1000, 1000, 1000, 1001, 
1001, 1001, 1001, 1001, 1001, 1001, 1002, 1002, 1002, 1002, 1002, 
1002, 1002, 1002, 1002, 1003, 1003, 1003, 1003, 1003, 1003), 
    grade = c("1", "2", "3", "4", "5", "6", "1", "2", "3", "4", 
    "5", "5", "6", "1", "2", "2", "3", "4", "4", "4", "5", "6", 
    "1", "2", "3", "4", "5", "6")), row.names = c(NA, -28L), class = c("tbl_df", 
"tbl", "data.frame"))->df

Appreciate your kindly help or guidance!

CodePudding user response：

You can check in each group if contains any duplicated grade:

library(tidyverse)

df %>%
  group_by(id) %>%
  mutate(GR =  any(duplicated(grade)))

Output

      id grade    GR
   <dbl> <chr> <int>
 1  1000 1         0
 2  1000 2         0
 3  1000 3         0
 4  1000 4         0
 5  1000 5         0
 6  1000 6         0
 7  1001 1         1
 8  1001 2         1
 9  1001 3         1
10  1001 4         1
# … with 18 more rows

CodePudding user response：

The question isn't posed in a friendly way if you want a Stata solution too, but here is one way to do it. The idea is just to get an indicator for any repeated grade and then to spread it to other observations (records, rows) for that identifier.

* Example generated by -dataex-. For more info, type help dataex
clear
input int id byte grade float GR
1000 1 0
1000 2 0
1000 3 0
1000 4 0
1000 5 0
1000 6 0
1001 1 1
1001 2 1
1001 3 1
1001 4 1
1001 5 1
1001 5 1
1001 6 1
1002 1 1
1002 2 1
1002 2 1
1002 3 1
1002 4 1
1002 4 1
1002 4 1
1002 5 1
1002 6 1
1003 1 0
1003 2 0
1003 3 0
1003 4 0
1003 5 0
1003 6 0
end

bysort id grade : gen wanted = _N > 1

bysort id (wanted) : replace wanted = wanted[_N]

sort id grade

list, sepby(id)

      ---------------------------- 
     |   id   grade   GR   wanted |
     |----------------------------|
  1. | 1000       1    0        0 |
  2. | 1000       2    0        0 |
  3. | 1000       3    0        0 |
  4. | 1000       4    0        0 |
  5. | 1000       5    0        0 |
  6. | 1000       6    0        0 |
     |----------------------------|
  7. | 1001       1    1        1 |
  8. | 1001       2    1        1 |
  9. | 1001       3    1        1 |
 10. | 1001       4    1        1 |
 11. | 1001       5    1        1 |
 12. | 1001       5    1        1 |
 13. | 1001       6    1        1 |
     |----------------------------|
 14. | 1002       1    1        1 |
 15. | 1002       2    1        1 |
 16. | 1002       2    1        1 |
 17. | 1002       3    1        1 |
 18. | 1002       4    1        1 |
 19. | 1002       4    1        1 |
 20. | 1002       4    1        1 |
 21. | 1002       5    1        1 |
 22. | 1002       6    1        1 |
     |----------------------------|
 23. | 1003       1    0        0 |
 24. | 1003       2    0        0 |
 25. | 1003       3    0        0 |
 26. | 1003       4    0        0 |
 27. | 1003       5    0        0 |
 28. | 1003       6    0        0 |
      ----------------------------