Home > Mobile >  Filter NA containing rows into a new data frame in R
Filter NA containing rows into a new data frame in R

Time:10-02

My dummy data frame for gene expression is this

**FRESHUPDATE** 

My data frame small one from my bigger gene expression which i want to filter

My small subset is this mat2

 Symbol TCGA.AB.2856 TCGA.AB.2849 TCGA.AB.2971 TCGA.AB.2930 TCGA.AB.2891 TCGA.AB.2872 TCGA.AB.2851 TCGA.AB.3011 TCGA.AB.2949
1      A2ML1  4.627857365     5.369632  6.700112904    5.6232636   4.75680637    5.8050996    6.2077827    5.2683007     5.232384
2     A4GALT  5.550918500     5.572321  4.569849528    6.2627817   5.25197103    6.4728585    3.8088796    5.5766959     6.458113
3     AACSP1 -0.004394347     1.195122 -0.004562859    0.1343311  -0.01469569    0.2808245    0.2881929    0.3270398     0.708931
4      ABCA9  5.652068819     5.579944  7.787378888    4.9460252   4.77917651    5.5384349    5.6242293    5.8726373     8.846332
5  ABCA9-AS1  0.557163318     1.701202  3.343076301    0.4203761   1.04232725    0.5324808    1.3794852    1.9304208     3.594210
6     ABCC13  4.077316070     8.840604  2.340835263    3.0782108   2.32162741    4.0645558    3.3683787    4.0456838     3.129047
7     ABLIM1  9.696391499    11.988791  9.873324476   10.5111442  10.81262360    9.0651002   10.6804131    9.4307673    11.879929
8     ABLIM3  5.292492658     5.979259  3.623770183    3.5016803   6.74841153    4.9092703    3.7786797    3.9352033     4.406261
9        ABO 10.631505004     6.859666  5.505456740   10.1379316   6.39110235   10.2743712    9.9307084    6.3601978    11.161422
10    ACOT12  1.648498344     3.762861  2.098422076    1.1439361   2.39612635    2.0490598    0.8765957    2.6902788     2.896370
> 

To get std dev i did this

mat1 <- mat2

mat1[,-1] <- lapply(mat1[,-1],
                  function(x) replace(x,abs(scale(x))>2,NA))

The to find rows with any NA I did this

mat_rown <- mat1 %>% remove_rownames %>% column_to_rownames(var="Symbol")

which(is.na(mat_rown),arr.ind = TRUE)

Which gives me this data frame

 Symbol TCGA.AB.2856 TCGA.AB.2849 TCGA.AB.2971 TCGA.AB.2930 TCGA.AB.2891 TCGA.AB.2872 TCGA.AB.2851 TCGA.AB.3011 TCGA.AB.2949
1      A2ML1     4.627857     5.369632     6.700113     5.623264     4.756806     5.805100     6.207783     5.268301     5.232384
2     A4GALT     5.550918     5.572321     4.569850     6.262782     5.251971     6.472859     3.808880     5.576696     6.458113
3     AACSP1           NA           NA           NA           NA           NA           NA           NA           NA           NA
4      ABCA9     5.652069     5.579944           NA     4.946025     4.779177     5.538435     5.624229     5.872637     8.846332
5  ABCA9-AS1           NA           NA     3.343076           NA           NA           NA     1.379485           NA     3.594210
6     ABCC13     4.077316     8.840604     2.340835     3.078211     2.321627     4.064556     3.368379     4.045684     3.129047
7     ABLIM1           NA           NA           NA           NA           NA           NA           NA           NA           NA
8     ABLIM3     5.292493     5.979259     3.623770     3.501680           NA     4.909270     3.778680     3.935203     4.406261
9        ABO           NA     6.859666     5.505457           NA           NA           NA           NA     6.360198           NA
10    ACOT12     1.648498     3.762861     2.098422     1.143936     2.396126     2.049060           NA     2.690279     2.896370

here we can see these genes they have some or the other NA in different columns so my objective is to take out these rows .

So when Im trying to index those rows with NA AACSP1,ABCA9-AS1,ABLIM1,ABO,ACOT12

I get these

row col
AACSP1      3   1
ABCA9-AS1   5   1
ABLIM1      7   1
ABO         9   1
AACSP1      3   2
ABCA9-AS1   5   2
ABLIM1      7   2
AACSP1      3   3
ABCA9       4   3
ABLIM1      7   3
AACSP1      3   4
ABCA9-AS1   5   4
ABLIM1      7   4
ABO         9   4
AACSP1      3   5
ABCA9-AS1   5   5
ABLIM1      7   5
ABLIM3      8   5
ABO         9   5
AACSP1      3   6
ABCA9-AS1   5   6
ABLIM1      7   6
ABO         9   6
AACSP1      3   7
ABLIM1      7   7
ABO         9   7
ACOT12     10   7
AACSP1      3   8
ABCA9-AS1   5   8
ABLIM1      7   8
AACSP1      3   9
ABLIM1      7   9
ABO         9   9

So my simple idea is to preserve these NA containing rows or genes in another data frame or object which i can further use downstream for my different analysis to check

CodePudding user response:

If you simply are trying to split your original frame into those that have a low vs high row standard deviation, you can do this:

rld2 <- as.data.frame((mat)) %>% rownames_to_column('gene')

# set your threshold that defines "high" deviation (i've picked a relatively low one here; you might choose something like 3)
sd_threshold = .6

# get the row-specific standard deviation, using `apply()`
row_sds = apply(rld2[,-1],1, \(r) sd(r))

# split into a list of two frames,
low_high_split <- split(rld2, f = row_sds>sd_threshold)
  • Related