Home > front end >  Deduplicate and lengthen dataframe in R
Deduplicate and lengthen dataframe in R

Time:11-13

After merging three datasets I've got a mess. There is a unique id field and then there can be one or more samples associated with each id. So far I've got

samples <- structure(list(id = c(1029459, 1029459, 1029459, 1029459, 1030272, 
1030272, 1030272, 1032157, 1032157, 1032178, 1032178, 1032219, 
1032219, 1032229, 1032229, 1032494, 1032494, 1032780, 1032780
), sample1 = c(853401, 853401, 853401, 853401, 852769, 852769, 
852769, 850161, 850161, 852711, 852711, 852597, 852597, 850363, 
850363, 850717, 850717, 848763, 848763), sample2 = c(853401, 
853693, 853667, 853667, 852769, 853597, 853597, NA, NA, 852711, 
853419, 852597, 852597, 850363, 852741, 850717, 851811, 848763, 
848763), sample3 = c(NA, NA, NA, NA, NA, NA, NA, 853621, 852621, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -19L))
print(samples)
#> # A tibble: 19 × 4
#>         id sample1 sample2 sample3
#>      <dbl>   <dbl>   <dbl>   <dbl>
#>  1 1029459  853401  853401      NA
#>  2 1029459  853401  853693      NA
#>  3 1029459  853401  853667      NA
#>  4 1029459  853401  853667      NA
#>  5 1030272  852769  852769      NA
#>  6 1030272  852769  853597      NA
#>  7 1030272  852769  853597      NA
#>  8 1032157  850161      NA  853621
#>  9 1032157  850161      NA  852621
#> 10 1032178  852711  852711      NA
#> 11 1032178  852711  853419      NA
#> 12 1032219  852597  852597      NA
#> 13 1032219  852597  852597      NA
#> 14 1032229  850363  850363      NA
#> 15 1032229  850363  852741      NA
#> 16 1032494  850717  850717      NA
#> 17 1032494  850717  851811      NA
#> 18 1032780  848763  848763      NA
#> 19 1032780  848763  848763      NA

I'd like to get it so that all unique samples per id are combined into one sample column with a long dataframe. eg

id       sample
1029459  853401
1029459  853693
1030272  852769
1030272  853597
1032157  850161
1032157  853621

Any ideas?

CodePudding user response:

Is this what you are looking for? For example: id 1029459 has the unique samples 853401 , 853693 , 853667.

samples %>% 
  pivot_longer(
    c(sample1, sample2, sample3)
   ,names_to = "sample")  %>% 
  count(id, sample, value) %>% 
  drop_na() %>% 
  distinct(id, value) %>% 
  rename(sample = value)

# A tibble: 16 × 2
        id sample
     <dbl>  <dbl>
 1 1029459 853401
 2 1029459 853667
 3 1029459 853693
 4 1030272 852769
 5 1030272 853597
 6 1032157 850161
 7 1032157 852621
 8 1032157 853621
 9 1032178 852711
10 1032178 853419
11 1032219 852597
12 1032229 850363
13 1032229 852741
14 1032494 850717
15 1032494 851811
16 1032780 848763

CodePudding user response:

You can do it this way:

library(data.table)
unique(melt(setDT(samples), "id",value.name = "sample")[!is.na(sample),c(1,3)])

Output:

         id sample
 1: 1029459 853401
 2: 1030272 852769
 3: 1032157 850161
 4: 1032178 852711
 5: 1032219 852597
 6: 1032229 850363
 7: 1032494 850717
 8: 1032780 848763
 9: 1029459 853693
10: 1029459 853667
11: 1030272 853597
12: 1032178 853419
13: 1032229 852741
14: 1032494 851811
15: 1032157 853621
16: 1032157 852621
  • Related