Home > database >  I need to convert the table columns into rows in r
I need to convert the table columns into rows in r

Time:12-01

I am working with sorting a table in r and i am running into issues.

I have data similar to this

Sample.ID SNP1 SNP2 SNP3 SNP4 SNP5
39 GG NA CC NA GG
39 NA CC NA NA NA
39 NA NA NA TT NA
40 CC NA NA CC CG
40 NA NA TT NA NA
40 NA GG NA NA NA

I am expecting something like this:

Sample.ID SNP1 SNP2 SNP3 SNP4 SNP5
39 GG CC CC TT GG
40 CC GG TT CC CG

Any help is appreciated!!

CodePudding user response:

Note that these blindly return the first non-NA value in a column for each group; if there are no non-NA values then it will return NA.

base R

aggregate(quux, quux["Sample.ID"], FUN = function(z) unique(na.omit(z)[1]))
#   Group.1 Sample.ID SNP1 SNP2 SNP3 SNP4 SNP5
# 1      39        39   GG   CC   CC   TT   GG
# 2      40        40   CC   GG   TT   CC   CG

dplyr

library(dplyr)
quux %>%
  group_by(Sample.ID) %>%
  summarize(across(everything(), ~ unique(na.omit(.))[1]))
# # A tibble: 2 x 6
#   Sample.ID SNP1  SNP2  SNP3  SNP4  SNP5 
#       <int> <chr> <chr> <chr> <chr> <chr>
# 1        39 GG    CC    CC    TT    GG   
# 2        40 CC    GG    TT    CC    CG   

data.table

library(data.table)
as.data.table(quux)[, lapply(.SD, function(z) unique(na.omit(z)[1])), by = Sample.ID]
#    Sample.ID   SNP1   SNP2   SNP3   SNP4   SNP5
#        <int> <char> <char> <char> <char> <char>
# 1:        39     GG     CC     CC     TT     GG
# 2:        40     CC     GG     TT     CC     CG

Data

quux <- structure(list(Sample.ID = c(39L, 39L, 39L, 40L, 40L, 40L), SNP1 = c("GG", NA, NA, "CC", NA, NA), SNP2 = c(NA, "CC", NA, NA, NA, "GG"), SNP3 = c("CC", NA, NA, NA, "TT", NA), SNP4 = c(NA, NA, "TT", "CC", NA, NA), SNP5 = c("GG", NA, NA, "CG", NA, NA)), class = "data.frame", row.names = c(NA, -6L))

CodePudding user response:

Another potential solution is to 'fill in' the NAs with the value above/below then remove the duplicate rows, i.e.

library(tidyverse)

df <- read.table(text = "Sample.ID  SNP1    SNP2    SNP3    SNP4    SNP5
39  GG  NA  CC  NA  GG
39  NA  CC  NA  NA  NA
39  NA  NA  NA  TT  NA
40  CC  NA  NA  CC  CG
40  NA  NA  TT  NA  NA
40  NA  GG  NA  NA  NA", header = TRUE)

df %>%
  group_by(Sample.ID) %>%
  fill(everything(), .direction = 'downup') %>%
  distinct()
#> # A tibble: 2 × 6
#> # Groups:   Sample.ID [2]
#>   Sample.ID SNP1  SNP2  SNP3  SNP4  SNP5 
#>       <int> <chr> <chr> <chr> <chr> <chr>
#> 1        39 GG    CC    CC    TT    GG   
#> 2        40 CC    GG    TT    CC    CG

The potential advantage of this approach is that it's easy to detect when/where you have multiple values for the same group and column, e.g. with sample 39 having two different values for SNP5, you get two rows in the output:

library(tidyverse)

df <- read.table(text = "Sample.ID  SNP1    SNP2    SNP3    SNP4    SNP5
39  GG  NA  CC  NA  GG
39  NA  CC  NA  NA  CC
39  NA  NA  NA  TT  NA
40  CC  NA  NA  CC  CG
40  NA  NA  TT  NA  NA
40  NA  GG  NA  NA  NA", header = TRUE)

df %>%
  group_by(Sample.ID) %>%
  fill(everything(), .direction = 'downup') %>%
  distinct()
#> # A tibble: 3 × 6
#> # Groups:   Sample.ID [2]
#>   Sample.ID SNP1  SNP2  SNP3  SNP4  SNP5 
#>       <int> <chr> <chr> <chr> <chr> <chr>
#> 1        39 GG    CC    CC    TT    GG   
#> 2        39 GG    CC    CC    TT    CC   
#> 3        40 CC    GG    TT    CC    CG

Created on 2022-12-01 with reprex v2.0.2

  •  Tags:  
  • r
  • Related