Home > front end >  How to use an ifelse statement to take means by groups with data.table syntax?
How to use an ifelse statement to take means by groups with data.table syntax?

Time:12-01

I use a data.table code that works fine but that I fail to transform to include an ifelse statement. I use the following reprex:

set.seed(1645)
Place <- c(rep("Copenhagen",7),rep("Berlin",11),rep("Roma",12))
Year <- c(rep("2020",4),rep("2021",3),rep("2020",6),rep("2021",5),rep("2019",4),rep("2020",4),rep("2021",4))
Value1 <- c(runif(3),NA,runif(8),NA,runif(9),NA,runif(7))
Value2 <- c(runif(4),NA,runif(2),runif(6),NA,NA,runif(11),NA,NA,runif(2))
df <- data.frame(Place,Year,Value1,Value2)

> df
        Place Year     Value1      Value2
1  Copenhagen 2020 0.10517697 0.865935100
2  Copenhagen 2020 0.96597760 0.579956282
3  Copenhagen 2020 0.47262307 0.346569960
4  Copenhagen 2020         NA 0.478763951
5  Copenhagen 2021 0.90030423          NA
6  Copenhagen 2021 0.14444142 0.280377315
7  Copenhagen 2021 0.73801550 0.302816525
8      Berlin 2020 0.13961383 0.641314310
9      Berlin 2020 0.40221211 0.756374251
10     Berlin 2020 0.49613139 0.070459347
11     Berlin 2020 0.95190545 0.184497038
12     Berlin 2020 0.40182901 0.407892240
13     Berlin 2020         NA 0.002209376
14     Berlin 2021 0.38310025          NA
15     Berlin 2021 0.76417492          NA
16     Berlin 2021 0.29001287 0.632133629
17     Berlin 2021 0.84478784 0.365406326
18     Berlin 2021 0.55547323 0.493870653
19       Roma 2019 0.44198733 0.067744090
20       Roma 2019 0.50403809 0.847876518
21       Roma 2019 0.85358805 0.952393606
22       Roma 2019 0.74996137 0.887583928
23       Roma 2020         NA 0.631937527
24       Roma 2020 0.08303509 0.993400333
25       Roma 2020 0.74205719 0.589183185
26       Roma 2020 0.27552659 0.522451407
27       Roma 2021 0.39518410          NA
28       Roma 2021 0.38390124          NA
29       Roma 2021 0.36605674 0.942102065
30       Roma 2021 0.32014949 0.375689863

I would like to calculate the mean by Place and Year if there are <= 25% NA in the group. Without my condition, this works fine:

setDT(df)
df_means <- df[,.(Value1_mean = mean(Value1),Value2_mean = mean(Value2)), by = .(Place,Year)]

> df_means
        Place Year Value1_mean Value2_mean
1: Copenhagen 2020          NA   0.4257258
2: Copenhagen 2021   0.3581245          NA
3:     Berlin 2020          NA   0.3935807
4:     Berlin 2021   0.3729461          NA
5:       Roma 2019   0.4572996   0.3956536
6:       Roma 2020          NA   0.6494491
7:       Roma 2021   0.4142637          NA

I fail to include the ifelse statement, this does not work:

df_means2 <- df[,.(Value1_mean = ifelse(sum(is.na(Value1))/length(Value1)>=0.25,NA,mean(Value1,na.rm=TRUE)),
                   Value2_mean = ifelse(sum(is.na(Value2))/length(Value2)>=0.25,NA,mean(Value2,na.rm=TRUE))), 
                by = .(Place,Year)]

I checked these posts 1, 2, and 3 without solving my issue. My expected result should be this:

> df_means2
       Place Year Value1_mean Value2_mean
1 Copenhagen 2020        mean        mean
2 Copenhagen 2021        mean        <NA>
3     Berlin 2020        mean        mean
4     Berlin 2021        mean        <NA>
5       Roma 2019        mean        mean
6       Roma 2020        mean        mean
7       Roma 2021        mean        <NA>

How can I transform my code?

CodePudding user response:

We may use if/else

library(data.table)
df[, lapply(.SD, function(x) if(mean(is.na(x)) <= 0.25) 
     mean(x, na.rm = TRUE) else NA_real_), by = .(Place, Year)]

-output

     Place Year    Value1    Value2
1: Copenhagen 2020 0.5145925 0.5678063
2: Copenhagen 2021 0.5942537        NA
3:     Berlin 2020 0.4783384 0.3437911
4:     Berlin 2021 0.5675098        NA
5:       Roma 2019 0.6373937 0.6888995
6:       Roma 2020 0.3668730 0.6842431
7:       Roma 2021 0.3663229        NA

In the OP's code, use NA_real_ instead of NA which by default is logical this creates a clash in the class

df[,.(Value1_mean = ifelse(sum(is.na(Value1))/length(Value1)>0.25,
    NA_real_,mean(Value1,na.rm=TRUE)),
                   Value2_mean = ifelse(sum(is.na(Value2))/length(Value2)>0.25,
      NA_real_,
      mean(Value2,na.rm=TRUE))), 
                by = .(Place,Year)]

-output

        Place Year Value1_mean Value2_mean
1: Copenhagen 2020   0.5145925   0.5678063
2: Copenhagen 2021   0.5942537          NA
3:     Berlin 2020   0.4783384   0.3437911
4:     Berlin 2021   0.5675098          NA
5:       Roma 2019   0.6373937   0.6888995
6:       Roma 2020   0.3668730   0.6842431
7:       Roma 2021   0.3663229          NA
  • Related