Home > Software engineering >  merge values based on duplicates in column
merge values based on duplicates in column


lets say I have the following data frame:

hugo <- c("bnv", "cdv", "gcd", "efd", "efd")
sample <- c("1", "2", "3", "2", "4")
data.frame(hugo, sample)

  hugo sample
1  bnv      1
2  cdv      2
3  gcd      3
4  efd      2
5  efd      4

I want to get rid of duplicate sample numbers and make like this:

     hugo2 sample2
1      bnv       1
2 cdv, efd       2
3      gcd       3
4      efd       4

is there a way to do this?

CodePudding user response:

Either using toString in aggregate,

(a1 <- aggregate(hugo ~ sample, df, toString))
#   sample     hugo
# 1      1      bnv
# 2      2 cdv, efd
# 3      3      gcd
# 4      4      efd


# 'data.frame': 4 obs. of  2 variables:
# $ sample: chr  "1" "2" "3" "4"
# $ hugo  : chr  "bnv" "cdv, efd" "gcd" "efd"

Or using list,

(a2 <- aggregate(hugo ~ sample, df, list))
#   sample     hugo
# 1      1      bnv
# 2      2 cdv, efd
# 3      3      gcd
# 4      4      efd

which looks similar, but:

# 'data.frame': 4 obs. of  2 variables:
# $ sample: chr  "1" "2" "3" "4"
# $ hugo  :List of 4
#  ..$ : chr "bnv"
#  ..$ : chr  "cdv" "efd"
#  ..$ : chr "gcd"
#  ..$ : chr "efd"

Depends on what you need.

CodePudding user response:

You could use dplyr and summarize together with paste0 to achieve this:

hugo <- c("bnv", "cdv", "gcd", "efd", "efd")
sample <- c("1", "2", "3", "2", "4")
df1 <- data.frame(hugo, sample)

df1 %>%
  group_by(sample) %>%
  summarize(hugo = paste0(hugo, collapse = ", ")) %>%
#> # A tibble: 4 × 2
#>   sample hugo   
#>   <fct>  <chr>  
#> 1 1      bnv    
#> 2 2      cdv, efd
#> 3 3      gcd    
#> 4 4      efd

CodePudding user response:

You can use toString() by group:

group_by(df,sample2=sample) %>% summarize(hugo2=toString(hugo))


# A tibble: 4 × 2
  sample2 hugo2   
  <chr>   <chr>   
1 1       bnv     
2 2       cdv, efd
3 3       gcd     
4 4       efd     
  •  Tags:  
  • r
  • Related