I use a data.table code that works fine but that I fail to transform to include an ifelse
statement. I use the following reprex:
set.seed(1645)
Place <- c(rep("Copenhagen",7),rep("Berlin",11),rep("Roma",12))
Year <- c(rep("2020",4),rep("2021",3),rep("2020",6),rep("2021",5),rep("2019",4),rep("2020",4),rep("2021",4))
Value1 <- c(runif(3),NA,runif(8),NA,runif(9),NA,runif(7))
Value2 <- c(runif(4),NA,runif(2),runif(6),NA,NA,runif(11),NA,NA,runif(2))
df <- data.frame(Place,Year,Value1,Value2)
> df
Place Year Value1 Value2
1 Copenhagen 2020 0.10517697 0.865935100
2 Copenhagen 2020 0.96597760 0.579956282
3 Copenhagen 2020 0.47262307 0.346569960
4 Copenhagen 2020 NA 0.478763951
5 Copenhagen 2021 0.90030423 NA
6 Copenhagen 2021 0.14444142 0.280377315
7 Copenhagen 2021 0.73801550 0.302816525
8 Berlin 2020 0.13961383 0.641314310
9 Berlin 2020 0.40221211 0.756374251
10 Berlin 2020 0.49613139 0.070459347
11 Berlin 2020 0.95190545 0.184497038
12 Berlin 2020 0.40182901 0.407892240
13 Berlin 2020 NA 0.002209376
14 Berlin 2021 0.38310025 NA
15 Berlin 2021 0.76417492 NA
16 Berlin 2021 0.29001287 0.632133629
17 Berlin 2021 0.84478784 0.365406326
18 Berlin 2021 0.55547323 0.493870653
19 Roma 2019 0.44198733 0.067744090
20 Roma 2019 0.50403809 0.847876518
21 Roma 2019 0.85358805 0.952393606
22 Roma 2019 0.74996137 0.887583928
23 Roma 2020 NA 0.631937527
24 Roma 2020 0.08303509 0.993400333
25 Roma 2020 0.74205719 0.589183185
26 Roma 2020 0.27552659 0.522451407
27 Roma 2021 0.39518410 NA
28 Roma 2021 0.38390124 NA
29 Roma 2021 0.36605674 0.942102065
30 Roma 2021 0.32014949 0.375689863
I would like to calculate the mean by Place and Year if there are <= 25% NA in the group. Without my condition, this works fine:
setDT(df)
df_means <- df[,.(Value1_mean = mean(Value1),Value2_mean = mean(Value2)), by = .(Place,Year)]
> df_means
Place Year Value1_mean Value2_mean
1: Copenhagen 2020 NA 0.4257258
2: Copenhagen 2021 0.3581245 NA
3: Berlin 2020 NA 0.3935807
4: Berlin 2021 0.3729461 NA
5: Roma 2019 0.4572996 0.3956536
6: Roma 2020 NA 0.6494491
7: Roma 2021 0.4142637 NA
I fail to include the ifelse
statement, this does not work:
df_means2 <- df[,.(Value1_mean = ifelse(sum(is.na(Value1))/length(Value1)>=0.25,NA,mean(Value1,na.rm=TRUE)),
Value2_mean = ifelse(sum(is.na(Value2))/length(Value2)>=0.25,NA,mean(Value2,na.rm=TRUE))),
by = .(Place,Year)]
I checked these posts 1, 2, and 3 without solving my issue. My expected result should be this:
> df_means2
Place Year Value1_mean Value2_mean
1 Copenhagen 2020 mean mean
2 Copenhagen 2021 mean <NA>
3 Berlin 2020 mean mean
4 Berlin 2021 mean <NA>
5 Roma 2019 mean mean
6 Roma 2020 mean mean
7 Roma 2021 mean <NA>
How can I transform my code?
CodePudding user response:
We may use if/else
library(data.table)
df[, lapply(.SD, function(x) if(mean(is.na(x)) <= 0.25)
mean(x, na.rm = TRUE) else NA_real_), by = .(Place, Year)]
-output
Place Year Value1 Value2
1: Copenhagen 2020 0.5145925 0.5678063
2: Copenhagen 2021 0.5942537 NA
3: Berlin 2020 0.4783384 0.3437911
4: Berlin 2021 0.5675098 NA
5: Roma 2019 0.6373937 0.6888995
6: Roma 2020 0.3668730 0.6842431
7: Roma 2021 0.3663229 NA
In the OP's code, use NA_real_
instead of NA
which by default is logical
this creates a clash in the class
df[,.(Value1_mean = ifelse(sum(is.na(Value1))/length(Value1)>0.25,
NA_real_,mean(Value1,na.rm=TRUE)),
Value2_mean = ifelse(sum(is.na(Value2))/length(Value2)>0.25,
NA_real_,
mean(Value2,na.rm=TRUE))),
by = .(Place,Year)]
-output
Place Year Value1_mean Value2_mean
1: Copenhagen 2020 0.5145925 0.5678063
2: Copenhagen 2021 0.5942537 NA
3: Berlin 2020 0.4783384 0.3437911
4: Berlin 2021 0.5675098 NA
5: Roma 2019 0.6373937 0.6888995
6: Roma 2020 0.3668730 0.6842431
7: Roma 2021 0.3663229 NA