Home > OS >  R: Aggregating data by ID, while replacing discrepancies in remaining columns with "missing&quo
R: Aggregating data by ID, while replacing discrepancies in remaining columns with "missing&quo

Time:09-07

I have data where I have multiple observations for each ID. I would like to aggregate/collapse the data by ID, however, some of the additional columns have discrepancies (the same ID could be male for one observation and female for another). If all observations agree, I'd like to keep the label, if they're are discrepancies, I'd like to replace it with "missing".

This is what I have

ID <- c("1234", "1234", "1242", "1456", "1234", "1242", "1234", "1234")
MEMBER_GENDER <- c("M", "M","F", "F", "M", "M", "M", "M")
RELATIONSHIP_TYPE_CODE <- c("1","1", "2", "3", "3", "2", "4" ,"3")
data <- data.frame(cbind(ID,MEMBER_GENDER, RELATIONSHIP_TYPE_CODE))

    ID MEMBER_GENDER RELATIONSHIP_TYPE_CODE
1 1234             M                      1
2 1234             M                      1
3 1242             F                      2
4 1456             F                      3
5 1234             M                      3
6 1242             M                      2
7 1234             M                      4
8 1234             M                      3

This is what I'd like

ID2 <- c("1234", "1242", "1456")
MEMBER_GENDER2 <- c("M", "missing", "F")
RELATIONSHIP_TYPE_CODE2 <- c("missing", "2", "3")
data2 <- data.frame(cbind(ID,MEMBER_GENDER, RELATIONSHIP_TYPE_CODE))

ID MEMBER_GENDER RELATIONSHIP_TYPE_CODE
1 1234             M                missing
2 1242       missing                      2
3 1456             F                      3

I can aggregate the data based on unique values, but then I end up with this

 aggregate(.~ID, data=data, FUN=unique)



 ID MEMBER_GENDER RELATIONSHIP_TYPE_CODE
1 1234             M                1, 3, 4
2 1242          F, M                      2
3 1456             F                      3

I'm not sure how to fill with missing instead of a combination of the variables.

CodePudding user response:

Similar to Abigail,

library(dplyr)
data %>% 
  group_by(ID) %>% 
  summarise(
    MEMBER_GENDER = list(unique(MEMBER_GENDER)), 
    RELATIONSHIP_TYPE_CODE = list(unique(RELATIONSHIP_TYPE_CODE))) %>% 
  mutate(across(-ID, function(x) ifelse(lengths(x) == 1, as.character(x), "missing")))

Output is

# A tibble: 3 × 3
  ID    MEMBER_GENDER RELATIONSHIP_TYPE_CODE
  <chr> <chr>         <chr>                 
1 1234  M             missing               
2 1242  missing       2                     
3 1456  F             3       

CodePudding user response:

data2 <- data %>% 
  group_by(ID) %>% 
  mutate(MEMBER_GENDER = case_when(n_distinct(MEMBER_GENDER) > 1 ~ "missing",
                                   n_distinct(MEMBER_GENDER) == 1 ~ MEMBER_GENDER)) %>% 
  mutate(RELATIONSHIP_TYPE_CODE = case_when(n_distinct(RELATIONSHIP_TYPE_CODE) > 1 ~ "missing",
                                   n_distinct(RELATIONSHIP_TYPE_CODE) == 1 ~ RELATIONSHIP_TYPE_CODE)) %>% 
  ungroup() %>% 
  distinct()

  ID    MEMBER_GENDER RELATIONSHIP_TYPE_CODE
  <chr> <chr>         <chr>                 
1 1234  M             missing               
2 1242  missing       2                     
3 1456  F             3   

Also, if you have more than just a two of these, you can make a function and apply it across all of them at once:

missing <- function(x) {
  case_when(n_distinct(x) > 1 ~ "missing",
            n_distinct(x) == 1 ~ x)
}

data3 <- data %>% 
  group_by(ID) %>% 
  mutate(across(c(MEMBER_GENDER, RELATIONSHIP_TYPE_CODE), missing)) %>% 
  ungroup() %>% 
  distinct()

CodePudding user response:

You can also use the following solution:

library(dplyr)

data %>%
  group_by(ID) %>%
  summarise(across(everything(), ~ c((first(.x)), 'Missing')[(length(unique(.x)) > 1)   1]))


# A tibble: 3 × 3
  ID    MEMBER_GENDER RELATIONSHIP_TYPE_CODE
  <chr> <chr>         <chr>                 
1 1234  M             Missing               
2 1242  Missing       Missing               
3 1456  F             3  

I noticed the sample data and the printed one do not match. I used the following data sample:

    ID MEMBER_GENDER RELATIONSHIP_TYPE_CODE
1 1234             M                      1
2 1234             M                      1
3 1242             F                      2
4 1456             F                      3
5 1234             M                      3
6 1242             M                      1
7 1234             M                      4
8 1234             M                      3
  • Related