Checking and reviewing previous grouped values within groups in R-CodePudding

Hello Everyone I hope you guys are having a great week,

I have a small dataset of 4 variables one is subject the second is key which is a code that a subject uses to log onto a system, third is order which will keep track of chronological year and finally the variable Period that indicates whether the key was used in a previous time "past" or the current month "current"

This is the dataset:

subjects<-c(rep("James",3),
            rep("Alex",2),
            rep("Mila",8),
            rep("Mark",1))

keys<-c(rep("IX08-8",2),"IX08-8",
        "UX-007","HH-011",rep("PO_85",7),"UJ_8","785_PO")
order<-c(1:14)
period<-c("past","past","current","past","current",rep("past",6),"current","current","current")
df<-cbind(subjects,keys,period,order)  

> head(df)
     subjects keys     period    order
[1,] "James"  "IX08-8" "past"    "1"  
[2,] "James"  "IX08-8" "past"    "2"  
[3,] "James"  "IX08-8" "current" "3"  
[4,] "Alex"   "UX-007" "past"    "4"  
[5,] "Alex"   "HH-011" "current" "5"  
[6,] "Mila"   "PO_85"  "past"    "6"

ultimately I have to be able to tell if a subject is using a previously used key to log onto the system in the current Period, if a subject uses a new key to log onto the system in the current period then I will assign the value "1" to a column named "result", if the user did not use a previously used key to log onto the system during the current period the value assigned should be "0" and "NA" otherwise

My desired output would look like this:

      subjects keys     period    order result
 [1,] "James"  "IX08-8" "past"    "1"   NA    
 [2,] "James"  "IX08-8" "past"    "2"   NA    
 [3,] "James"  "IX08-8" "current" "3"   "0"   
 [4,] "Alex"   "UX-007" "past"    "4"   NA    
 [5,] "Alex"   "HH-011" "current" "5"   "1"   
 [6,] "Mila"   "PO_85"  "past"    "6"   NA    
 [7,] "Mila"   "PO_85"  "past"    "7"   NA    
 [8,] "Mila"   "PO_85"  "past"    "8"   NA    
 [9,] "Mila"   "PO_85"  "past"    "9"   NA    
[10,] "Mila"   "PO_85"  "past"    "10"  NA    
[11,] "Mila"   "PO_85"  "past"    "11"  NA    
[12,] "Mila"   "PO_85"  "current" "12"  "0"   
[13,] "Mila"   "UJ_8"   "current" "13"  "1"   
[14,] "Mark"   "785_PO" "current" "14"  "1"

for instance, in row # 3 James has assigned value 0 in result because he has used a previously used key to log on to the system in the current month which was key "IX08-8", but Mark has a value of 1 in the result column because the system has only tracked one key and that happens to be the key that he used to log on this current period which is technically a "new key"

What have I done to solve this?

I can group the dataset by subject and make sure this is arrange by order in a descending manner, but I can only think about creating a vector of keys (vector.of.previous.keys) per subject based on (period="past") and then evaluate if the current key is %in% vector.of.previous.keys, but if there is a way to only check whithin groups this criteria it would be a lot more efficient thank you so much guys for your help

CodePudding user response：

Another dplyr method here, where all of your conditions are coded in a case_when statement.

Code

library(dplyr)

df %>% 
  group_by(subjects) %>% 
  mutate(result = case_when(period == "current" & n() == 1 ~ "1",
                            period == "current" & keys == first(keys) ~ "0",
                            period == "current" & keys != first(keys) & n() > 1 ~ "1",
                            period == "past" ~ NA_character_,
                            TRUE == "past" ~ NA_character_))
# A tibble: 14 × 5
# Groups:   subjects [4]
   subjects keys   period  order result
   <chr>    <chr>  <chr>   <int> <chr> 
 1 James    IX08-8 past        1 NA    
 2 James    IX08-8 past        2 NA    
 3 James    IX08-8 current     3 0     
 4 Alex     UX-007 past        4 NA    
 5 Alex     HH-011 current     5 1     
 6 Mila     PO_85  past        6 NA    
 7 Mila     PO_85  past        7 NA    
 8 Mila     PO_85  past        8 NA    
 9 Mila     PO_85  past        9 NA    
10 Mila     PO_85  past       10 NA    
11 Mila     PO_85  past       11 NA    
12 Mila     PO_85  current    12 0     
13 Mila     UJ_8   current    13 1     
14 Mark     785_PO current    14 1

data

Note that I've changed your cbind() to data.frame (data frame is easier to work with compared to matrix).

subjects<-c(rep("James",3),
            rep("Alex",2),
            rep("Mila",8),
            rep("Mark",1))

keys<-c(rep("IX08-8",2),"IX08-8",
        "UX-007","HH-011",rep("PO_85",7),"UJ_8","785_PO")
order<-c(1:14)
period<-c("past","past","current","past","current",rep("past",6),"current","current","current")
df<-data.frame(subjects,keys,period,order)

CodePudding user response：

Assuming your data is stored in a data.frame

df <- data.frame(subjects,keys,period,order)

you could use

library(dplyr)

df %>% 
  group_by(subjects, keys) %>% 
  mutate(count = row_number()) %>% 
  group_by(subjects) %>% 
  mutate(result = case_when(period == "current" & count == 1 ~ 1,
                            period == "current" & count >= 1 ~ 0,
                            TRUE ~ NA_real_)) %>% 
  ungroup() %>% 
  select(-count)

to get

# A tibble: 14 x 5
   subjects keys   period  order result
   <chr>    <chr>  <chr>   <int>  <dbl>
 1 James    IX08-8 past        1     NA
 2 James    IX08-8 past        2     NA
 3 James    IX08-8 current     3      0
 4 Alex     UX-007 past        4     NA
 5 Alex     HH-011 current     5      1
 6 Mila     PO_85  past        6     NA
 7 Mila     PO_85  past        7     NA
 8 Mila     PO_85  past        8     NA
 9 Mila     PO_85  past        9     NA
10 Mila     PO_85  past       10     NA
11 Mila     PO_85  past       11     NA
12 Mila     PO_85  current    12      0
13 Mila     UJ_8   current    13      1
14 Mark     785_PO current    14      1