Home > Back-end >  Merging rows of dataframe based on unique ID in R
Merging rows of dataframe based on unique ID in R

Time:06-15

I’m new to R – be gentle.

I have a dataframe where a unique subject ID is repeated twice for all participants. The data following seems to consist of one column for which the value is NA one of the entries, and one column where there is a value for one of the entries (though this is not certain and the method I use should account for the possibility of this not being true). Here is an example:

Name <- c("Jon", "Jon", "Maria", "Maria", "Tina", "Tina", "dan", 'dan', 'wen', 'wen')
a <- c(1, 1, 2, 2, 3, 4, 4, 4, 5, 6)
b <- c(NA, 1, NA, 2, NA, 3, NA, 4, NA, 5)
c <- c(1, NA, 2, NA, 3, NA, 4, NA, 5, NA)
df <- data.frame(Name, a, b, c)

The solution I thought of so far consists of looping through all the unique IDs (in the above example, Names) and making separate dataframes for each of the entries. Something like this:

#Instantiate list of lists that will become dfs
firstdf <- c()
seconddf <- c()

#Loop through existing df by unique ID (Name) and create 
# list containing values of 1 entry and list of the other 
for (i in unique(df$Name)) {
  innerlist1 <- c()
  innerlist2 <- c()
  
  for (x in c(1:length(df[df['Name'] == i]))) {
    if (x%%2 == 1) {
      # Takes one set of entries per ID
      innerlist1 <- c(innerlist1, df[df['Name'] == i][x])
      
    } else if (x%%2 == 0) {
      # Takes other set of entries per ID
      innerlist2 <- c(innerlist2, df[df['Name'] == i][x])
    }
  }
  firstdf <- c(firstdf, list(innerlist1))
  seconddf <- c(seconddf, list(innerlist2))
}
# Make dfs from lists
firstdf <- do.call(rbind.data.frame, firstdf)
names(firstdf) <- names(df)

seconddf <- do.call(rbind.data.frame, seconddf)
names(seconddf) <- names(df)

I would then proceed to combine the dfs by using something like merge, with by="Name". My original dataset is large and this is not particularly efficient or elegant. Can anyone suggest improvements? Thank you.

CodePudding user response:

You can keep first value by group exluding NA:

library(dplyr)

df %>%
  group_by(Name) %>%
  summarise(a = first(stats::na.omit(a)), 
            b = first(stats::na.omit(b)), 
            c = first(stats::na.omit(c)))
# A tibble: 5 x 4
  Name      a     b     c
  <chr> <dbl> <dbl> <dbl>
1 dan       4     4     4
2 Jon       1     1     1
3 Maria     2     2     2
4 Tina      3     3     3
5 wen       5     5     5

CodePudding user response:

If there are multiple non-NA values per ID, you can concentrate them toString. You can use the following code:

library(dplyr)
df %>% 
  group_by(Name) %>% 
  summarise_all(funs(toString(na.omit(.))))

Output:

# A tibble: 5 × 4
  Name  a     b     c    
  <chr> <chr> <chr> <chr>
1 dan   4, 4  4     4    
2 Jon   1, 1  1     1    
3 Maria 2, 2  2     2    
4 Tina  3, 4  3     3    
5 wen   5, 6  5     5 
  • Related