I am working with sorting a table in r and i am running into issues.
I have data similar to this
Sample.ID | SNP1 | SNP2 | SNP3 | SNP4 | SNP5 |
---|---|---|---|---|---|
39 | GG | NA | CC | NA | GG |
39 | NA | CC | NA | NA | NA |
39 | NA | NA | NA | TT | NA |
40 | CC | NA | NA | CC | CG |
40 | NA | NA | TT | NA | NA |
40 | NA | GG | NA | NA | NA |
I am expecting something like this:
Sample.ID | SNP1 | SNP2 | SNP3 | SNP4 | SNP5 |
---|---|---|---|---|---|
39 | GG | CC | CC | TT | GG |
40 | CC | GG | TT | CC | CG |
Any help is appreciated!!
CodePudding user response:
Note that these blindly return the first non-NA
value in a column for each group; if there are no non-NA
values then it will return NA
.
base R
aggregate(quux, quux["Sample.ID"], FUN = function(z) unique(na.omit(z)[1]))
# Group.1 Sample.ID SNP1 SNP2 SNP3 SNP4 SNP5
# 1 39 39 GG CC CC TT GG
# 2 40 40 CC GG TT CC CG
dplyr
library(dplyr)
quux %>%
group_by(Sample.ID) %>%
summarize(across(everything(), ~ unique(na.omit(.))[1]))
# # A tibble: 2 x 6
# Sample.ID SNP1 SNP2 SNP3 SNP4 SNP5
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 39 GG CC CC TT GG
# 2 40 CC GG TT CC CG
data.table
library(data.table)
as.data.table(quux)[, lapply(.SD, function(z) unique(na.omit(z)[1])), by = Sample.ID]
# Sample.ID SNP1 SNP2 SNP3 SNP4 SNP5
# <int> <char> <char> <char> <char> <char>
# 1: 39 GG CC CC TT GG
# 2: 40 CC GG TT CC CG
Data
quux <- structure(list(Sample.ID = c(39L, 39L, 39L, 40L, 40L, 40L), SNP1 = c("GG", NA, NA, "CC", NA, NA), SNP2 = c(NA, "CC", NA, NA, NA, "GG"), SNP3 = c("CC", NA, NA, NA, "TT", NA), SNP4 = c(NA, NA, "TT", "CC", NA, NA), SNP5 = c("GG", NA, NA, "CG", NA, NA)), class = "data.frame", row.names = c(NA, -6L))
CodePudding user response:
Another potential solution is to 'fill in' the NAs with the value above/below then remove the duplicate rows, i.e.
library(tidyverse)
df <- read.table(text = "Sample.ID SNP1 SNP2 SNP3 SNP4 SNP5
39 GG NA CC NA GG
39 NA CC NA NA NA
39 NA NA NA TT NA
40 CC NA NA CC CG
40 NA NA TT NA NA
40 NA GG NA NA NA", header = TRUE)
df %>%
group_by(Sample.ID) %>%
fill(everything(), .direction = 'downup') %>%
distinct()
#> # A tibble: 2 × 6
#> # Groups: Sample.ID [2]
#> Sample.ID SNP1 SNP2 SNP3 SNP4 SNP5
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 39 GG CC CC TT GG
#> 2 40 CC GG TT CC CG
The potential advantage of this approach is that it's easy to detect when/where you have multiple values for the same group and column, e.g. with sample 39 having two different values for SNP5, you get two rows in the output:
library(tidyverse)
df <- read.table(text = "Sample.ID SNP1 SNP2 SNP3 SNP4 SNP5
39 GG NA CC NA GG
39 NA CC NA NA CC
39 NA NA NA TT NA
40 CC NA NA CC CG
40 NA NA TT NA NA
40 NA GG NA NA NA", header = TRUE)
df %>%
group_by(Sample.ID) %>%
fill(everything(), .direction = 'downup') %>%
distinct()
#> # A tibble: 3 × 6
#> # Groups: Sample.ID [2]
#> Sample.ID SNP1 SNP2 SNP3 SNP4 SNP5
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 39 GG CC CC TT GG
#> 2 39 GG CC CC TT CC
#> 3 40 CC GG TT CC CG
Created on 2022-12-01 with reprex v2.0.2