In the data frame below I want to remove the rows in the case where Freq
is higher than 0.5 in more than 50% of the IDs in the loc
column. For instance, below, all rows containing G__Achromobacter
should be removed since Freq
is above 0.5 for more than 50% of loc
.
I have tried tidyverse
and group_by
using loc and colSums, but not figuring out.
ID loc absolute Freq variable value
2 G__Abiotrophia Brain 9 0.2294118 NotPresent 0.4705882
11 G__Abiotrophia Gallbladder 13 0.1652174 NotPresent 0.4347826
12 G__Abiotrophia Gastroesophageal 7 0.1750000 NotPresent 0.1250000
31 G__Abiotrophia Urothelial tract 82 0.5503356 NotPresent 0.4496644
82 G__Achromobacter Brain 11 0.1470588 NotPresent 0.3529412
93 G__Achromobacter Head and neck 33 0.5409836 NotPresent 0.4590164
95 G__Achromobacter Kidney 66 0.5365854 NotPresent 0.4634146
99 G__Achromobacter Mesothelium 19 0.5135135 NotPresent 0.4864865
102 G__Achromobacter Pancreas 63 0.5575221 NotPresent 0.4424779
dput
df <- structure(list(ID = c("G__Abiotrophia", "G__Abiotrophia", "G__Abiotrophia",
"G__Abiotrophia", "G__Achromobacter", "G__Achromobacter", "G__Achromobacter",
"G__Achromobacter", "G__Achromobacter"), loc = c("Brain", "Gallbladder",
"Gastroesophageal", "Urothelial tract", "Brain", "Head and neck",
"Kidney", "Mesothelium", "Pancreas"), absolute = c(9L, 13L, 7L,
82L, 11L, 33L, 66L, 19L, 63L), Freq = c(0.229411764705882, 0.165217391304348,
0.175, 0.550335570469799, 0.147058823529412, 0.540983606557377,
0.536585365853659, 0.513513513513513, 0.557522123893805), variable = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("NotPresent", "Present"
), class = "factor"), value = c(0.470588235294118, 0.434782608695652,
0.125, 0.449664429530201, 0.352941176470588, 0.459016393442623,
0.463414634146341, 0.486486486486487, 0.442477876106195)), row.names = c(2L,
11L, 12L, 31L, 82L, 93L, 95L, 99L, 102L), class = "data.frame")
CodePudding user response:
Count the number of Freq > 0.5
, if the count is greater than 50% of unique loc
, remove it.
library(tidyverse)
df %>%
group_by(ID) %>%
filter(ifelse(sum(Freq > 0.5) > length(unique(loc))/2, F, T))
# A tibble: 4 x 6
# Groups: ID [1]
ID loc absolute Freq variable value
<chr> <chr> <int> <dbl> <fct> <dbl>
1 G__Abiotrophia Brain 9 0.229 NotPresent 0.471
2 G__Abiotrophia Gallbladder 13 0.165 NotPresent 0.435
3 G__Abiotrophia Gastroesophageal 7 0.175 NotPresent 0.125
4 G__Abiotrophia Urothelial tract 82 0.550 NotPresent 0.450
CodePudding user response:
I found a way to solve your problem. It is probably not the easiest solution but it works.
df %>% group_by(ID) %>% mutate(condition=ifelse(Freq>0.5,0,1)) %>%
mutate(selected=sum(condition/length(ID))) %>%
filter(selected>0.5) %>%
select(!c(condition,selected))
# A tibble: 4 × 6
# Groups: ID [1]
ID loc absolute Freq variable value
<chr> <chr> <int> <dbl> <fct> <dbl>
1 G__Abiotrophia Brain 9 0.229 NotPresent 0.471
2 G__Abiotrophia Gallbladder 13 0.165 NotPresent 0.435
3 G__Abiotrophia Gastroesophageal 7 0.175 NotPresent 0.125
4 G__Abiotrophia Urothelial tract 82 0.550 NotPresent 0.450
Let me explain you what is happening. First you group the variables by ID so each function is performed independantly on each group then you check for each row if their frequence is above 0.5 and you assign the value 0 (FALSE) if this condition is satisfied. Then you do the sum of the true rows divided by the total sum of the rows in the group, if this sum is above 0.5 then it satisfies your condition. Eventually you delete the rows I created and your dataframe has correctly been trimmed of the rows you wanted to get rid off.