Home > database >  remove or filter values based on row
remove or filter values based on row

Time:04-26

I am trying to remove certain rows from my dataset based on values. My dataset looks like this: enter image description here

I want to filter rows and remove rows which has "." value as well as modify rows which has many rsid and separate them and put them into individual rows. I tried to do this with filter function but its giving me error.

The command used by me:

filter(rsid_en_vcf, X1 != ".")

Error in filter(rsid_en_vcf, X1 != ".") : object 'X1' not found

My dataset is:

dput(rsid_en_vcf[1:48, 1])
c("rs782629217", "rs782403204", "rs199529001", ".", "rs147880041", 
".", ".", "rs141826009", "rs199826048", "rs200558688", "rs782114919", 
"rs41304577", ".", "rs200311430", "rs147114528", "rs200635479", 
"rs41288741", "rs782167952", "rs6560827", "rs200242637", "rs144539776", 
"rs41305669", "rs41288743", "rs41288743", "rs369736529", "rs148025238", 
"rs41298226", "rs782272071", "rs9329304", "rs9329305", "rs137895574", 
"rs142619172", "rs144154384", "rs782777737", "rs782796368", "rs782443786", 
"rs782246853", "rs150779790", "rs782304204", "rs9329306", "rs144740103", 
"rs4431953", "rs189892388;rs75953774", "rs61839057", "rs61839058", 
"rs145405488", "rs782307404", "rs782307404")

CodePudding user response:

In regular expressions, . means anything. So, if you just use . in a filter statement, then it would keep everything. So, to search explicitly for a ., then we need to escape that by either looking for a fixed period (i.e., [.]) or escape with \\.

library(tidyverse)

df %>% 
  filter(!str_detect(codes, "[.]"))

Or you can use \\:

df %>% 
  filter(!str_detect(codes, "\\."))

Or in base R:

df[!grepl("\\.", df$codes),]

Or set fixed = TRUE:

df[!grepl(".", df$codes, fixed = TRUE), ]

Output

                    codes
1             rs782629217
2             rs782403204
3             rs199529001
4             rs147880041
5             rs141826009
6             rs199826048
7             rs200558688
8             rs782114919
9              rs41304577
10            rs200311430
11            rs147114528
12            rs200635479
13             rs41288741
14            rs782167952
15              rs6560827
16            rs200242637
17            rs144539776
18             rs41305669
19             rs41288743
20             rs41288743
21            rs369736529
22            rs148025238
23             rs41298226
24            rs782272071
25              rs9329304
26              rs9329305
27            rs137895574
28            rs142619172
29            rs144154384
30            rs782777737
31            rs782796368
32            rs782443786
33            rs782246853
34            rs150779790
35            rs782304204
36              rs9329306
37            rs144740103
38              rs4431953
39 rs189892388;rs75953774
40             rs61839057
41             rs61839058
42            rs145405488
43            rs782307404
44            rs782307404

Data

df <- structure(list(codes = c("rs782629217", "rs782403204", "rs199529001", 
".", "rs147880041", ".", ".", "rs141826009", "rs199826048", "rs200558688", 
"rs782114919", "rs41304577", ".", "rs200311430", "rs147114528", 
"rs200635479", "rs41288741", "rs782167952", "rs6560827", "rs200242637", 
"rs144539776", "rs41305669", "rs41288743", "rs41288743", "rs369736529", 
"rs148025238", "rs41298226", "rs782272071", "rs9329304", "rs9329305", 
"rs137895574", "rs142619172", "rs144154384", "rs782777737", "rs782796368", 
"rs782443786", "rs782246853", "rs150779790", "rs782304204", "rs9329306", 
"rs144740103", "rs4431953", "rs189892388;rs75953774", "rs61839057", 
"rs61839058", "rs145405488", "rs782307404", "rs782307404")), class = "data.frame", row.names = c(NA, 
-48L))
  • Related