Home > Net >  Using a temporal inner variable in dplyr outside of the group
Using a temporal inner variable in dplyr outside of the group

Time:09-02

I need to calculate an FDR variable per group, using an expected random distribution of p values (corresponds to the "Random" type).

library(dplyr)
library(data.table)
calculate_empirical_fdr = function(control_pVal, test_pVal) {
  m_control = length(control_pVal)
  m_test = length(test_pVal)
  unlist(lapply(test_pVal, function(significance_threshold) {
    m_control = length(control_pVal)
    m_test = length(test_pVal)

    FP_expected = length(control_pVal[control_pVal<=significance_threshold])*m_test/m_control # number of 
expected false positives in a p-value sequence with the size m_test
    S = length(test_pVal[test_pVal<=significance_threshold]) # number of significant hits (FP   TP)
    return(FP_expected/S)
  }))
}

An example dataset with groups I need to control for in the "Group" variable:

set.seed(42)
library(dplyr); library(data.table)
dataset_test = data.table(Type = c(rep("Random", 500),
                                   rep("test1", 500),
                                   rep("test2", 500)),
                          Group = sample(c("group1", "group2", "group3"), 1500, replace = T),
                          Pvalue = c(runif(n = 500),
                                     rbeta(n = 500, shape1 = 1, shape2 = 4),
                                     rbeta(n = 500, shape1 = 1, shape2 = 6))
                          )

Data visualization: enter image description here

I have found that the best way to use my function per group would be using a temporal variable where I can store the p values of the random type, but this does not work:

dataset_test %>%
  group_by(Group) %>%
  {filter(Type=="Random") %>% select(Pvalue) ->> control_set } %>%
  group_by(Type, add = T) %>%
  mutate(FDR_empirical = calculate_empirical_fdr(control_pVal = control_set,
                              test_pVal = Pvalue)) %>%
  data.table()

Error in filter(Type == "Random") : object 'Type' not found

I understand that probably temporal vairables "do not see" the environment within the data.table, would be glad to hear any suggestions how to fix it.

CodePudding user response:

You can do something like this, which filters the control group P-values using the data.table special .BY

setDT(dataset_test)
dataset_test[
  i= Type!="Random",
  j = FDR_empirical:=calculate_empirical_fdr(dataset_test[Type=="Random" &  Group ==.BY$Group, Pvalue], Pvalue),
  by = .(Group, Type)
]

Output:

        Type  Group     Pvalue FDR_empirical
   1: Random group1 0.70292111            NA
   2: Random group1 0.72383117            NA
   3: Random group1 0.76413459            NA
   4: Random group1 0.87942702            NA
   5: Random group2 0.71229213            NA
  ---                                       
1496:  test2 group3 0.34817178     0.3681791
1497:  test2 group1 0.22419118     0.2308988
1498:  test2 group3 0.07258545     0.2314655
1499:  test2 group2 0.24687976     0.2849462
1500:  test2 group1 0.12206777     0.1760657

CodePudding user response:

Two possible solutions

Use the dot .

dataset_test %>%
  group_by(Group) %>%
  {filter(., Type=="Random") %>% select(Pvalue) ->> control_set; . } %>%
  group_by(Type, add = T)

Use the tee-pipe %T>% from the magrittr package

library(magrittr)

dataset_test %>%
  group_by(Group) %T>% {
      filter(., Type=="Random") %>% select(Pvalue) ->> control_set} %>% 
  group_by(Type, add = T) 
  • Related