Home > OS >  Filter data frame to remove some repeated factors with different sign
Filter data frame to remove some repeated factors with different sign

Time:08-24

I have one big data frame with different columns like name, position, expression level, q value and so on, and i have many repeats for most of the objects with same name but different expression levels, so I want to filter them if expression levels are in opposite of each other for example up( ) and down (-) regulated values, omit and remove those, but if it finds repeats with different expressions but all up ( ) or all down (-) regulated, keep them. here is an example of my file:

df1<-data.frame(gene.name=c( "DEC1","DEC1","DEC1","ATP","ANXA2","ANXA1","ANXA1","ANXA1"),
                expression.level=c(2.01,0.5,-1.56,3.1,0.67,0.1,1.2,3),
                q.value=c(0.001,0.002,0.0001,0.9,0.00001,0.9,0.0002,0.002))

and output like this:

output<-data.frame(gene.name=c( "ATP","ANXA2","ANXA1","ANXA1","ANXA1"),
                   expression.level=c(3.1,0.67,0.1,1.2,3),
                   q.value=c(0.9,0.00001,0.9,0.0002,0.002))

Thanks in advance for your help.

CodePudding user response:

We can use sign() to check whether they are positive or negative or zero. Then use filter to include those that have the same sign.

library(dplyr)

df1 %>% 
  group_by(gene.name) %>% 
  filter(length(unique(sign(expression.level))) == 1) %>% 
  ungroup()

  gene.name expression.level q.value
1       ATP             3.10   9e-01
2     ANXA2             0.67   1e-05
3     ANXA1             0.10   9e-01
4     ANXA1             1.20   2e-04
5     ANXA1             3.00   2e-03

CodePudding user response:

Using ave you can do this with a one-liner.

df1[with(df1, ave(expression.level, gene.name, FUN=\(x) length(unique(sign(x))))) == 1, ]
#   gene.name expression.level q.value
# 4       ATP             3.10   9e-01
# 5     ANXA2             0.67   1e-05
# 6     ANXA1             0.10   9e-01
# 7     ANXA1             1.20   2e-04
# 8     ANXA1             3.00   2e-03

CodePudding user response:

Using data.table

library(data.table)
setDT(df1)[df1[, .I[uniqueN(sign(expression.level)) == 1], gene.name]$V1]

-output

  gene.name expression.level q.value
      <char>            <num>   <num>
1:       ATP             3.10   9e-01
2:     ANXA2             0.67   1e-05
3:     ANXA1             0.10   9e-01
4:     ANXA1             1.20   2e-04
5:     ANXA1             3.00   2e-03
  • Related