I am trying to remove certain rows from my dataset based on values. My dataset looks like this:
I want to filter rows and remove rows which has "." value as well as modify rows which has many rsid and separate them and put them into individual rows. I tried to do this with filter function but its giving me error.
The command used by me:
filter(rsid_en_vcf, X1 != ".")
Error in filter(rsid_en_vcf, X1 != ".") : object 'X1' not found
My dataset is:
dput(rsid_en_vcf[1:48, 1])
c("rs782629217", "rs782403204", "rs199529001", ".", "rs147880041",
".", ".", "rs141826009", "rs199826048", "rs200558688", "rs782114919",
"rs41304577", ".", "rs200311430", "rs147114528", "rs200635479",
"rs41288741", "rs782167952", "rs6560827", "rs200242637", "rs144539776",
"rs41305669", "rs41288743", "rs41288743", "rs369736529", "rs148025238",
"rs41298226", "rs782272071", "rs9329304", "rs9329305", "rs137895574",
"rs142619172", "rs144154384", "rs782777737", "rs782796368", "rs782443786",
"rs782246853", "rs150779790", "rs782304204", "rs9329306", "rs144740103",
"rs4431953", "rs189892388;rs75953774", "rs61839057", "rs61839058",
"rs145405488", "rs782307404", "rs782307404")
CodePudding user response:
In regular expressions, .
means anything. So, if you just use .
in a filter
statement, then it would keep everything. So, to search explicitly for a .
, then we need to escape that by either looking for a fixed period (i.e., [.]
) or escape with \\
.
library(tidyverse)
df %>%
filter(!str_detect(codes, "[.]"))
Or you can use \\
:
df %>%
filter(!str_detect(codes, "\\."))
Or in base R:
df[!grepl("\\.", df$codes),]
Or set fixed = TRUE
:
df[!grepl(".", df$codes, fixed = TRUE), ]
Output
codes
1 rs782629217
2 rs782403204
3 rs199529001
4 rs147880041
5 rs141826009
6 rs199826048
7 rs200558688
8 rs782114919
9 rs41304577
10 rs200311430
11 rs147114528
12 rs200635479
13 rs41288741
14 rs782167952
15 rs6560827
16 rs200242637
17 rs144539776
18 rs41305669
19 rs41288743
20 rs41288743
21 rs369736529
22 rs148025238
23 rs41298226
24 rs782272071
25 rs9329304
26 rs9329305
27 rs137895574
28 rs142619172
29 rs144154384
30 rs782777737
31 rs782796368
32 rs782443786
33 rs782246853
34 rs150779790
35 rs782304204
36 rs9329306
37 rs144740103
38 rs4431953
39 rs189892388;rs75953774
40 rs61839057
41 rs61839058
42 rs145405488
43 rs782307404
44 rs782307404
Data
df <- structure(list(codes = c("rs782629217", "rs782403204", "rs199529001",
".", "rs147880041", ".", ".", "rs141826009", "rs199826048", "rs200558688",
"rs782114919", "rs41304577", ".", "rs200311430", "rs147114528",
"rs200635479", "rs41288741", "rs782167952", "rs6560827", "rs200242637",
"rs144539776", "rs41305669", "rs41288743", "rs41288743", "rs369736529",
"rs148025238", "rs41298226", "rs782272071", "rs9329304", "rs9329305",
"rs137895574", "rs142619172", "rs144154384", "rs782777737", "rs782796368",
"rs782443786", "rs782246853", "rs150779790", "rs782304204", "rs9329306",
"rs144740103", "rs4431953", "rs189892388;rs75953774", "rs61839057",
"rs61839058", "rs145405488", "rs782307404", "rs782307404")), class = "data.frame", row.names = c(NA,
-48L))