How to sort and subset data in r-CodePudding

I am trying to sort a row based on the first letters of the cells and i am having hard time to write a code in r.

snp	allele1	allele2
mmsoop	A	A
rs3122	C	G
SNP1234	T	C
rs3144	A	A

The above is the example dataset to show how my dataset looks like and i want to subset the whole table based on snp row where the snp column starts with "rs" and "SNP"

Expected table:

snp	allele1	allele2
rs3122	C	G
SNP1234	T	C
rs3144	A	A

Any help is appreciated!!

CodePudding user response：

Alternatively,

df<- read.table(
  text= "
snp allele1 allele2
mmsoop  A   A
rs3122  C   G
SNP1234 T   C
rs3144  A   A",
  header=T
)

df[grep("^(SNP|rs)",df$snp),]

     snp allele1 allele2
2  rs3122       C       G
3 SNP1234       T       C
4  rs3144       A       A

CodePudding user response：

We may use grepl in subset to create a logical vector by matching the rs or (|) SNP from the start (^) of the string to subset the rows

subset(df1, grepl("^(rs|SNP)", snp))
      snp allele1 allele2
2  rs3122       C       G
3 SNP1234       T       C
4  rs3144       A       A

data

df1 <- structure(list(snp = c("mmsoop", "rs3122", "SNP1234", "rs3144"
), allele1 = c("A", "C", "T", "A"), allele2 = c("A", "G", "C", 
"A")), class = "data.frame", row.names = c(NA, -4L))

CodePudding user response：

We could combine filter with str_detect:

library(dplyr)
library(stringr)

df %>% 
  filter(str_detect(snp, 'rs|SNP'))

      snp allele1 allele2
1  rs3122       C       G
2 SNP1234       T       C
3  rs3144       A       A