How to detect the multivalued observation for each ID in dataset?-CodePudding

I have a dataset contains 3 different vars like this:

id   gender phase
a1     m      1
a1     m      2
a1     m      3
b2     m      1
b2     f      2
b2     m      3
c3     f      1
c3     f      2
c3     f      3
...

Notice that for id==b2, phase==2, the gender is accidentally marked as "f", it should be consistent with other phases as gender=="m" because the gender cannot be changed during the study phases.So if I want to run a R code to detect which ids have such issue, how should I accomplish that goal? Thanks a lot~~

CodePudding user response：

With dplyr, you could detect which ids have more than one genders with n_distinct().

library(dplyr)

df %>%
  group_by(id) %>%
  filter(n_distinct(gender) > 1) %>%
  ungroup()

# # A tibble: 3 × 3
#   id    gender phase
#   <chr> <chr>  <int>
# 1 b2    m          1
# 2 b2    f          2
# 3 b2    m          3

CodePudding user response：

You can use lag to check if the value changed in the column and filter the id that have a change like this:

df <- read.table(text="id   gender phase
a1     m      1
a1     m      2
a1     m      3
b2     m      1
b2     f      2
b2     m      3
c3     f      1
c3     f      2
c3     f      3", header = TRUE)

library(dplyr)
df %>%
  group_by(id) %>%
  filter(any(gender != lag(gender)))
#> # A tibble: 3 × 3
#> # Groups:   id [1]
#>   id    gender phase
#>   <chr> <chr>  <int>
#> 1 b2    m          1
#> 2 b2    f          2
#> 3 b2    m          3

^{Created on 2022-07-13 by the reprex package (v2.0.1)}

CodePudding user response：

id<-c("a1","a1","a1","b2","b2","b2","c3","c3","c3")
gender<-c("m","m","m","m","f","m","f","f","f")
phase<-c(1,2,3,1,2,3,1,2,3)
mydata<-data.frame(id,gender,phase)
mydata[mydata$id%in%c("a1","b2"),"gender"]<-"m"
mydata[mydata$id%in%c("c3"),"gender"]<-"f"
mydata