I have data where I have multiple observations for each ID. I would like to aggregate/collapse the data by ID, however, some of the additional columns have discrepancies (the same ID could be male for one observation and female for another). If all observations agree, I'd like to keep the label, if they're are discrepancies, I'd like to replace it with "missing".
This is what I have
ID <- c("1234", "1234", "1242", "1456", "1234", "1242", "1234", "1234")
MEMBER_GENDER <- c("M", "M","F", "F", "M", "M", "M", "M")
RELATIONSHIP_TYPE_CODE <- c("1","1", "2", "3", "3", "2", "4" ,"3")
data <- data.frame(cbind(ID,MEMBER_GENDER, RELATIONSHIP_TYPE_CODE))
ID MEMBER_GENDER RELATIONSHIP_TYPE_CODE
1 1234 M 1
2 1234 M 1
3 1242 F 2
4 1456 F 3
5 1234 M 3
6 1242 M 2
7 1234 M 4
8 1234 M 3
This is what I'd like
ID2 <- c("1234", "1242", "1456")
MEMBER_GENDER2 <- c("M", "missing", "F")
RELATIONSHIP_TYPE_CODE2 <- c("missing", "2", "3")
data2 <- data.frame(cbind(ID,MEMBER_GENDER, RELATIONSHIP_TYPE_CODE))
ID MEMBER_GENDER RELATIONSHIP_TYPE_CODE
1 1234 M missing
2 1242 missing 2
3 1456 F 3
I can aggregate the data based on unique values, but then I end up with this
aggregate(.~ID, data=data, FUN=unique)
ID MEMBER_GENDER RELATIONSHIP_TYPE_CODE
1 1234 M 1, 3, 4
2 1242 F, M 2
3 1456 F 3
I'm not sure how to fill with missing instead of a combination of the variables.
CodePudding user response:
Similar to Abigail,
library(dplyr)
data %>%
group_by(ID) %>%
summarise(
MEMBER_GENDER = list(unique(MEMBER_GENDER)),
RELATIONSHIP_TYPE_CODE = list(unique(RELATIONSHIP_TYPE_CODE))) %>%
mutate(across(-ID, function(x) ifelse(lengths(x) == 1, as.character(x), "missing")))
Output is
# A tibble: 3 × 3
ID MEMBER_GENDER RELATIONSHIP_TYPE_CODE
<chr> <chr> <chr>
1 1234 M missing
2 1242 missing 2
3 1456 F 3
CodePudding user response:
data2 <- data %>%
group_by(ID) %>%
mutate(MEMBER_GENDER = case_when(n_distinct(MEMBER_GENDER) > 1 ~ "missing",
n_distinct(MEMBER_GENDER) == 1 ~ MEMBER_GENDER)) %>%
mutate(RELATIONSHIP_TYPE_CODE = case_when(n_distinct(RELATIONSHIP_TYPE_CODE) > 1 ~ "missing",
n_distinct(RELATIONSHIP_TYPE_CODE) == 1 ~ RELATIONSHIP_TYPE_CODE)) %>%
ungroup() %>%
distinct()
ID MEMBER_GENDER RELATIONSHIP_TYPE_CODE
<chr> <chr> <chr>
1 1234 M missing
2 1242 missing 2
3 1456 F 3
Also, if you have more than just a two of these, you can make a function and apply it across all of them at once:
missing <- function(x) {
case_when(n_distinct(x) > 1 ~ "missing",
n_distinct(x) == 1 ~ x)
}
data3 <- data %>%
group_by(ID) %>%
mutate(across(c(MEMBER_GENDER, RELATIONSHIP_TYPE_CODE), missing)) %>%
ungroup() %>%
distinct()
CodePudding user response:
You can also use the following solution:
library(dplyr)
data %>%
group_by(ID) %>%
summarise(across(everything(), ~ c((first(.x)), 'Missing')[(length(unique(.x)) > 1) 1]))
# A tibble: 3 × 3
ID MEMBER_GENDER RELATIONSHIP_TYPE_CODE
<chr> <chr> <chr>
1 1234 M Missing
2 1242 Missing Missing
3 1456 F 3
I noticed the sample data and the printed one do not match. I used the following data sample:
ID MEMBER_GENDER RELATIONSHIP_TYPE_CODE
1 1234 M 1
2 1234 M 1
3 1242 F 2
4 1456 F 3
5 1234 M 3
6 1242 M 1
7 1234 M 4
8 1234 M 3