Using a temporal inner variable in dplyr outside of the group-CodePudding

I need to calculate an FDR variable per group, using an expected random distribution of p values (corresponds to the "Random" type).

library(dplyr)
library(data.table)
calculate_empirical_fdr = function(control_pVal, test_pVal) {
  m_control = length(control_pVal)
  m_test = length(test_pVal)
  unlist(lapply(test_pVal, function(significance_threshold) {
    m_control = length(control_pVal)
    m_test = length(test_pVal)

    FP_expected = length(control_pVal[control_pVal<=significance_threshold])*m_test/m_control # number of 
expected false positives in a p-value sequence with the size m_test
    S = length(test_pVal[test_pVal<=significance_threshold]) # number of significant hits (FP   TP)
    return(FP_expected/S)
  }))
}

An example dataset with groups I need to control for in the "Group" variable:

set.seed(42)
library(dplyr); library(data.table)
dataset_test = data.table(Type = c(rep("Random", 500),
                                   rep("test1", 500),
                                   rep("test2", 500)),
                          Group = sample(c("group1", "group2", "group3"), 1500, replace = T),
                          Pvalue = c(runif(n = 500),
                                     rbeta(n = 500, shape1 = 1, shape2 = 4),
                                     rbeta(n = 500, shape1 = 1, shape2 = 6))
                          )

Data visualization:

I have found that the best way to use my function per group would be using a temporal variable where I can store the p values of the random type, but this does not work:

dataset_test %>%
  group_by(Group) %>%
  {filter(Type=="Random") %>% select(Pvalue) ->> control_set } %>%
  group_by(Type, add = T) %>%
  mutate(FDR_empirical = calculate_empirical_fdr(control_pVal = control_set,
                              test_pVal = Pvalue)) %>%
  data.table()

Error in filter(Type == "Random") : object 'Type' not found

I understand that probably temporal vairables "do not see" the environment within the data.table, would be glad to hear any suggestions how to fix it.

CodePudding user response：

You can do something like this, which filters the control group P-values using the data.table special .BY

setDT(dataset_test)
dataset_test[
  i= Type!="Random",
  j = FDR_empirical:=calculate_empirical_fdr(dataset_test[Type=="Random" &  Group ==.BY$Group, Pvalue], Pvalue),
  by = .(Group, Type)
]

Output:

        Type  Group     Pvalue FDR_empirical
   1: Random group1 0.70292111            NA
   2: Random group1 0.72383117            NA
   3: Random group1 0.76413459            NA
   4: Random group1 0.87942702            NA
   5: Random group2 0.71229213            NA
  ---                                       
1496:  test2 group3 0.34817178     0.3681791
1497:  test2 group1 0.22419118     0.2308988
1498:  test2 group3 0.07258545     0.2314655
1499:  test2 group2 0.24687976     0.2849462
1500:  test2 group1 0.12206777     0.1760657

CodePudding user response：

Two possible solutions

Use the dot .

dataset_test %>%
  group_by(Group) %>%
  {filter(., Type=="Random") %>% select(Pvalue) ->> control_set; . } %>%
  group_by(Type, add = T)

Use the tee-pipe %T>% from the magrittr package

library(magrittr)

dataset_test %>%
  group_by(Group) %T>% {
      filter(., Type=="Random") %>% select(Pvalue) ->> control_set} %>% 
  group_by(Type, add = T)