My dummy data frame for gene expression is this
**FRESHUPDATE**
My data frame small one from my bigger gene expression which i want to filter
My small subset is this mat2
Symbol TCGA.AB.2856 TCGA.AB.2849 TCGA.AB.2971 TCGA.AB.2930 TCGA.AB.2891 TCGA.AB.2872 TCGA.AB.2851 TCGA.AB.3011 TCGA.AB.2949
1 A2ML1 4.627857365 5.369632 6.700112904 5.6232636 4.75680637 5.8050996 6.2077827 5.2683007 5.232384
2 A4GALT 5.550918500 5.572321 4.569849528 6.2627817 5.25197103 6.4728585 3.8088796 5.5766959 6.458113
3 AACSP1 -0.004394347 1.195122 -0.004562859 0.1343311 -0.01469569 0.2808245 0.2881929 0.3270398 0.708931
4 ABCA9 5.652068819 5.579944 7.787378888 4.9460252 4.77917651 5.5384349 5.6242293 5.8726373 8.846332
5 ABCA9-AS1 0.557163318 1.701202 3.343076301 0.4203761 1.04232725 0.5324808 1.3794852 1.9304208 3.594210
6 ABCC13 4.077316070 8.840604 2.340835263 3.0782108 2.32162741 4.0645558 3.3683787 4.0456838 3.129047
7 ABLIM1 9.696391499 11.988791 9.873324476 10.5111442 10.81262360 9.0651002 10.6804131 9.4307673 11.879929
8 ABLIM3 5.292492658 5.979259 3.623770183 3.5016803 6.74841153 4.9092703 3.7786797 3.9352033 4.406261
9 ABO 10.631505004 6.859666 5.505456740 10.1379316 6.39110235 10.2743712 9.9307084 6.3601978 11.161422
10 ACOT12 1.648498344 3.762861 2.098422076 1.1439361 2.39612635 2.0490598 0.8765957 2.6902788 2.896370
>
To get std dev i did this
mat1 <- mat2
mat1[,-1] <- lapply(mat1[,-1],
function(x) replace(x,abs(scale(x))>2,NA))
The to find rows with any NA
I did this
mat_rown <- mat1 %>% remove_rownames %>% column_to_rownames(var="Symbol")
which(is.na(mat_rown),arr.ind = TRUE)
Which gives me this data frame
Symbol TCGA.AB.2856 TCGA.AB.2849 TCGA.AB.2971 TCGA.AB.2930 TCGA.AB.2891 TCGA.AB.2872 TCGA.AB.2851 TCGA.AB.3011 TCGA.AB.2949
1 A2ML1 4.627857 5.369632 6.700113 5.623264 4.756806 5.805100 6.207783 5.268301 5.232384
2 A4GALT 5.550918 5.572321 4.569850 6.262782 5.251971 6.472859 3.808880 5.576696 6.458113
3 AACSP1 NA NA NA NA NA NA NA NA NA
4 ABCA9 5.652069 5.579944 NA 4.946025 4.779177 5.538435 5.624229 5.872637 8.846332
5 ABCA9-AS1 NA NA 3.343076 NA NA NA 1.379485 NA 3.594210
6 ABCC13 4.077316 8.840604 2.340835 3.078211 2.321627 4.064556 3.368379 4.045684 3.129047
7 ABLIM1 NA NA NA NA NA NA NA NA NA
8 ABLIM3 5.292493 5.979259 3.623770 3.501680 NA 4.909270 3.778680 3.935203 4.406261
9 ABO NA 6.859666 5.505457 NA NA NA NA 6.360198 NA
10 ACOT12 1.648498 3.762861 2.098422 1.143936 2.396126 2.049060 NA 2.690279 2.896370
here we can see these genes they have some or the other NA
in different columns so my objective is to take out these rows .
So when Im trying to index those rows with NA
AACSP1,ABCA9-AS1,ABLIM1,ABO,ACOT12
I get these
row col
AACSP1 3 1
ABCA9-AS1 5 1
ABLIM1 7 1
ABO 9 1
AACSP1 3 2
ABCA9-AS1 5 2
ABLIM1 7 2
AACSP1 3 3
ABCA9 4 3
ABLIM1 7 3
AACSP1 3 4
ABCA9-AS1 5 4
ABLIM1 7 4
ABO 9 4
AACSP1 3 5
ABCA9-AS1 5 5
ABLIM1 7 5
ABLIM3 8 5
ABO 9 5
AACSP1 3 6
ABCA9-AS1 5 6
ABLIM1 7 6
ABO 9 6
AACSP1 3 7
ABLIM1 7 7
ABO 9 7
ACOT12 10 7
AACSP1 3 8
ABCA9-AS1 5 8
ABLIM1 7 8
AACSP1 3 9
ABLIM1 7 9
ABO 9 9
So my simple idea is to preserve these NA
containing rows or genes in another data frame or object which i can further use downstream for my different analysis to check
CodePudding user response:
If you simply are trying to split your original frame into those that have a low vs high row standard deviation, you can do this:
rld2 <- as.data.frame((mat)) %>% rownames_to_column('gene')
# set your threshold that defines "high" deviation (i've picked a relatively low one here; you might choose something like 3)
sd_threshold = .6
# get the row-specific standard deviation, using `apply()`
row_sds = apply(rld2[,-1],1, \(r) sd(r))
# split into a list of two frames,
low_high_split <- split(rld2, f = row_sds>sd_threshold)