I have a large dataframe in long format which I want to convert to wide format at a later stage. There are entries per StudyId by several annotators. I want to filter the dataframe to only contain entries per StudyID of only one annotator. Preferably, following a hierarchy of the annotators. Meaning, if the first AnnotatorID (of some list) is present keep these entries, if not look for the second in line and so forth.
Here is some sample code to replicate:
StudyId <- c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c")
AnnotatorId <- c("Frank", "Frank", "Steffi", "Steffi", "Steffi", "Max", "Max", "Toni", "Frank", "Frank", "Annabelle", "Annabelle")
a <- data.frame(StudyId, AnnotatorId)
The data.frame consists of appr. 160 variables and more than 3000 observations. The IDs are simplified in this example but in my dataframe they consist of a mixture of digits and letters like: "034e6cee-79e8-4e67-a27a-1ee2c187eaf4". Neither working by alphabetical order or highest value does most likely help, I guess.
So far, I tried to order the levels of AnnotatorId but do not know how to loop over all entries and keep only AnnotatorIds in order of appearance in the factor.
a$AnnotatorId <- factor(a$AnnotatorId,
levels = c(
"Max",
"Annabelle",
"Toni",
"Steffi",
"Frank"
), ordered = TRUE)
In the end I would like to have something like:
StudyId | AnnotatorId |
---|---|
a | Steffi |
a | Steffi |
a | Steffi |
b | Max |
b | Max |
c | Annabelle |
c | Annabelle |
I am a programming novice. So any help and kind guidance is highly appreciated.
CodePudding user response:
You can group_by StudyId and filter
:
StudyId <- c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c")
AnnotatorId <- c("Frank", "Frank", "Steffi", "Steffi", "Steffi", "Max", "Max", "Toni", "Frank", "Frank", "Annabelle", "Annabelle")
a <- data.frame(StudyId, AnnotatorId)
a$AnnotatorId <- factor(a$AnnotatorId,
levels = rev(c(
"Max",
"Annabelle",
"Toni",
"Steffi",
"Frank"
)), ordered = TRUE)
a %>%
group_by(StudyId) %>%
filter(AnnotatorId == max(AnnotatorId))
output
# A tibble: 7 × 2
# Groups: StudyId [3]
StudyId AnnotatorId
<chr> <ord>
1 a Steffi
2 a Steffi
3 a Steffi
4 b Max
5 b Max
6 c Annabelle
7 c Annabelle
CodePudding user response:
I would advise against ordering nominal values on ordinal scale. An ordinary factor
is sufficient, since it's underlying integer structure is already "ordered" according to the levels starting with 1
. Just use min
in ave
.
hier <- c("Max", "Annabelle", "Toni", "Steffi", "Frank")
a <- transform(a, AnnotatorId=factor(AnnotatorId, levels = hier))
a[as.logical(ave(as.integer(a$AnnotatorId), a$StudyId, FUN=\(x) x == min(x))), ]
# StudyId AnnotatorId
# 3 a Steffi
# 4 a Steffi
# 5 a Steffi
# 6 b Max
# 7 b Max
# 11 c Annabelle
# 12 c Annabelle