Filter rows based on colSums for groups-CodePudding

In the data frame below I want to remove the rows in the case where Freq is higher than 0.5 in more than 50% of the IDs in the loc column. For instance, below, all rows containing G__Achromobacter should be removed since Freq is above 0.5 for more than 50% of loc.

I have tried tidyverse and group_by using loc and colSums, but not figuring out.

                  ID              loc absolute      Freq   variable     value
2     G__Abiotrophia            Brain        9 0.2294118 NotPresent 0.4705882
11    G__Abiotrophia      Gallbladder       13 0.1652174 NotPresent 0.4347826
12    G__Abiotrophia Gastroesophageal        7 0.1750000 NotPresent 0.1250000
31    G__Abiotrophia Urothelial tract       82 0.5503356 NotPresent 0.4496644
82  G__Achromobacter            Brain       11 0.1470588 NotPresent 0.3529412
93  G__Achromobacter    Head and neck       33 0.5409836 NotPresent 0.4590164
95  G__Achromobacter           Kidney       66 0.5365854 NotPresent 0.4634146
99  G__Achromobacter      Mesothelium       19 0.5135135 NotPresent 0.4864865
102 G__Achromobacter         Pancreas       63 0.5575221 NotPresent 0.4424779

dput

df <- structure(list(ID = c("G__Abiotrophia", "G__Abiotrophia", "G__Abiotrophia", 
"G__Abiotrophia", "G__Achromobacter", "G__Achromobacter", "G__Achromobacter", 
"G__Achromobacter", "G__Achromobacter"), loc = c("Brain", "Gallbladder", 
"Gastroesophageal", "Urothelial tract", "Brain", "Head and neck", 
"Kidney", "Mesothelium", "Pancreas"), absolute = c(9L, 13L, 7L, 
82L, 11L, 33L, 66L, 19L, 63L), Freq = c(0.229411764705882, 0.165217391304348, 
0.175, 0.550335570469799, 0.147058823529412, 0.540983606557377, 
0.536585365853659, 0.513513513513513, 0.557522123893805), variable = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("NotPresent", "Present"
), class = "factor"), value = c(0.470588235294118, 0.434782608695652, 
0.125, 0.449664429530201, 0.352941176470588, 0.459016393442623, 
0.463414634146341, 0.486486486486487, 0.442477876106195)), row.names = c(2L, 
11L, 12L, 31L, 82L, 93L, 95L, 99L, 102L), class = "data.frame")

CodePudding user response：

Count the number of Freq > 0.5, if the count is greater than 50% of unique loc, remove it.

library(tidyverse)

df %>% 
  group_by(ID) %>% 
  filter(ifelse(sum(Freq > 0.5) > length(unique(loc))/2, F, T))

# A tibble: 4 x 6
# Groups:   ID [1]
  ID             loc              absolute  Freq variable   value
  <chr>          <chr>               <int> <dbl> <fct>      <dbl>
1 G__Abiotrophia Brain                   9 0.229 NotPresent 0.471
2 G__Abiotrophia Gallbladder            13 0.165 NotPresent 0.435
3 G__Abiotrophia Gastroesophageal        7 0.175 NotPresent 0.125
4 G__Abiotrophia Urothelial tract       82 0.550 NotPresent 0.450

CodePudding user response：

I found a way to solve your problem. It is probably not the easiest solution but it works.

 df %>% group_by(ID) %>% mutate(condition=ifelse(Freq>0.5,0,1)) %>%
   mutate(selected=sum(condition/length(ID))) %>%
   filter(selected>0.5) %>% 
   select(!c(condition,selected))
# A tibble: 4 × 6
# Groups:   ID [1]
  ID             loc              absolute  Freq variable   value
  <chr>          <chr>               <int> <dbl> <fct>      <dbl>
1 G__Abiotrophia Brain                   9 0.229 NotPresent 0.471
2 G__Abiotrophia Gallbladder            13 0.165 NotPresent 0.435
3 G__Abiotrophia Gastroesophageal        7 0.175 NotPresent 0.125
4 G__Abiotrophia Urothelial tract       82 0.550 NotPresent 0.450

Let me explain you what is happening. First you group the variables by ID so each function is performed independantly on each group then you check for each row if their frequence is above 0.5 and you assign the value 0 (FALSE) if this condition is satisfied. Then you do the sum of the true rows divided by the total sum of the rows in the group, if this sum is above 0.5 then it satisfies your condition. Eventually you delete the rows I created and your dataframe has correctly been trimmed of the rows you wanted to get rid off.