I want to compare the effect of different fertilizer doses on multiple crop cultivars at various locations. My dataset is similar to the one generated below:
locs <- rep(c("loc1","loc2","loc3"), length.out=180)
cults <- rep(c("cult1","cult2","cult3","cult4","cult5"), length.out=180)
doses <- rep(c("no_fert","40kg","50kg","60kg"), length.out=180)
set.seed(123); yld <- runif(3*length(unique(locs))*length(unique(cults))*length(unique(doses)), min=3, max=15)
dat <- data.frame(location=locs,
cultivar=cults,
fert_dose=doses,
yield=yld)
Note there are three repetitions of each fertilizer dosage (but there are more in my actual dataset).
The first thing I need to do is to calculate the average for the three repetitions of each location-cultivar-fertilizer combination.
I can do it - in a probably not so efficient way - like this:
d1 <- d2 <- d3 <- list()
for (i in 1:length(unique(locs))){
for (j in 1:length(unique(cults))){
for (k in 1:length(unique(doses))){
d1[[k]] <- data.frame(location=locs[i],
cultivar=cults[j],
fert_dose=doses[k],
mean_yield=mean(dat[dat$location==locs[i]&dat$cultivar==cults[j]&dat$fert_dose==doses[k],]$yield))
}
d2[[j]] <- do.call(rbind,d1)
}
d3[[i]] <- do.call(rbind,d2)
}
(mean_dat <- do.call(rbind, d3))
Next, what I need to do is: for each location, find the yield difference among all combinations of cultivar and fertilizer doses.
For example, considering only loc1
and cult1
, the expected result would be:
res <- "
location cultivar dose dose_mean other_cultivar other_dose other_mean diff
loc1 cult1 no_fert 9.402345 cult1 40kg 9.251377 0.150968
loc1 cult1 no_fert 9.402345 cult1 50kg 10.764692 -1.362347
loc1 cult1 no_fert 9.402345 cult1 60kg 10.119129 -0.716784
loc1 cult1 40kg 9.251377 cult1 no_fert 9.402345 -0.150968
loc1 cult1 40kg 9.251377 cult1 50kg 10.764692 -1.513315
loc1 cult1 40kg 9.251377 cult1 60kg 10.119129 -0.867752
loc1 cult1 50kg 10.764692 cult1 no_fert 9.402345 1.362347
loc1 cult1 50kg 10.764692 cult1 40kg 9.251377 1.513315
loc1 cult1 50kg 10.764692 cult1 60kg 10.119129 0.645563
loc1 cult1 60kg 10.119129 cult1 no_fert 9.402345 0.716784
loc1 cult1 60kg 10.119129 cult1 40kg 9.251377 0.867752
loc1 cult1 60kg 10.119129 cult1 50kg 10.764692 -0.645563
"
(res <- read.table(textConnection(res), sep=" ", header=T, stringsAsFactors=F))
In this table, I am repeating the yield values for each dose obtained in the previous step (mean_dat
table) and calculating a simple the difference between them. The resulting table would continue this analysis, including the other cultivars in the other_cultivar
column.
I reckon the expected table does not look very good, but it will be used to feed an interactive dashboard, and this is the format it requires, so I don't think I have much choice here.
Is there any programmatic way to achieve these two results in just one step?
CodePudding user response:
dplyr
library(dplyr)
dat %>%
group_by(location, cultivar, dose = fert_dose) %>%
summarize(dose_mean = mean(yield), .groups = "drop") %>%
full_join(., ., by = "location", suffix = c("", "_other")) %>%
filter(cultivar != cultivar_other | dose != dose_other) %>%
mutate(diff = dose_mean - dose_mean_other)
# # A tibble: 1,140 x 8
# location cultivar dose dose_mean cultivar_other dose_other dose_mean_other diff
# <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
# 1 loc1 cult1 40kg 9.25 cult1 50kg 10.8 -1.51
# 2 loc1 cult1 40kg 9.25 cult1 60kg 10.1 -0.868
# 3 loc1 cult1 40kg 9.25 cult1 no_fert 9.40 -0.151
# 4 loc1 cult1 40kg 9.25 cult2 40kg 10.1 -0.830
# 5 loc1 cult1 40kg 9.25 cult2 50kg 8.97 0.282
# 6 loc1 cult1 40kg 9.25 cult2 60kg 6.71 2.54
# 7 loc1 cult1 40kg 9.25 cult2 no_fert 11.5 -2.20
# 8 loc1 cult1 40kg 9.25 cult3 40kg 11.9 -2.70
# 9 loc1 cult1 40kg 9.25 cult3 50kg 9.21 0.0421
# 10 loc1 cult1 40kg 9.25 cult3 60kg 9.26 -0.00416
# # ... with 1,130 more rows
Note that this is doing an outer join on cultivar
and dose
. We started with 180 rows and ended with 1140, this will grow geometrically.
data.table
library(data.table)
DT <- as.data.table(dat)[, .(dose_mean = mean(yield)), by = .(location, cultivar, dose = fert_dose)]
merge(DT, DT, by = "location", all = TRUE, suffix = c("", "_other"), allow.cartesian = TRUE
)[(cultivar != cultivar_other | dose != dose_other),
][, diff := dose_mean - dose_mean_other][]
# location cultivar dose dose_mean cultivar_other dose_other dose_mean_other diff
# <char> <char> <char> <num> <char> <char> <num> <num>
# 1: loc1 cult1 no_fert 9.402345 cult4 60kg 8.508675 0.893670057
# 2: loc1 cult1 no_fert 9.402345 cult2 50kg 8.969489 0.432856209
# 3: loc1 cult1 no_fert 9.402345 cult5 40kg 9.345814 0.056530679
# 4: loc1 cult1 no_fert 9.402345 cult3 no_fert 11.243009 -1.840663741
# 5: loc1 cult1 no_fert 9.402345 cult1 60kg 10.119129 -0.716784445
# 6: loc1 cult1 no_fert 9.402345 cult4 50kg 9.638162 -0.235817407
# 7: loc1 cult1 no_fert 9.402345 cult2 40kg 10.081336 -0.678991009
# 8: loc1 cult1 no_fert 9.402345 cult5 no_fert 9.405199 -0.002854273
# 9: loc1 cult1 no_fert 9.402345 cult3 60kg 9.255537 0.146807576
# 10: loc1 cult1 no_fert 9.402345 cult1 50kg 10.764692 -1.362347580
# ---
# 1131: loc3 cult5 60kg 8.442893 cult5 40kg 7.217206 1.225686617
# 1132: loc3 cult5 60kg 8.442893 cult3 no_fert 8.688523 -0.245630492
# 1133: loc3 cult5 60kg 8.442893 cult1 60kg 7.221926 1.220966527
# 1134: loc3 cult5 60kg 8.442893 cult4 50kg 7.918912 0.523980425
# 1135: loc3 cult5 60kg 8.442893 cult2 40kg 7.405098 1.037794838
# 1136: loc3 cult5 60kg 8.442893 cult5 no_fert 6.963170 1.479722527
# 1137: loc3 cult5 60kg 8.442893 cult3 60kg 8.183201 0.259691148
# 1138: loc3 cult5 60kg 8.442893 cult1 50kg 9.444416 -1.001523464
# 1139: loc3 cult5 60kg 8.442893 cult4 40kg 10.264777 -1.821884187
# 1140: loc3 cult5 60kg 8.442893 cult2 no_fert 7.196217 1.246675164
Note that doing this is data.table
works well but does not really reduce the memory footprint of in-place calculations or speed usually attributed to data.table
-based solutions.
CodePudding user response:
With data.table
, you can do the following join:
library(data.table)
locs <- rep(c("loc1","loc2","loc3"), length.out=180)
cults <- rep(c("cult1","cult2","cult3","cult4","cult5"), length.out=180)
doses <- rep(c("no_fert","40kg","50kg","60kg"), length.out=180)
set.seed(123); yld <- runif(3*length(unique(locs))*length(unique(cults))*length(unique(doses)), min=3, max=15)
dat <- data.frame(location=locs,
cultivar=cults,
fert_dose=doses,
yield=yld)
setDT(dat)
dat[dat, .(cultivar_1 = cultivar,
cultivar_2 = i.cultivar,
fert_dose_1 = fert_dose,
fert_dose_2 = i.fert_dose,
yield_1 = yield,
yield_2 = i.yield,
diff = yield - i.yield), on = "location", by = .EACHI][
!(cultivar_1 == cultivar_2 & fert_dose_1 == fert_dose_2)][
order(location, cultivar_1,fert_dose_1, cultivar_2, fert_dose_2)]
#> location cultivar_1 cultivar_2 fert_dose_1 fert_dose_2 yield_1
#> 1: loc1 cult1 cult1 40kg 50kg 4.665673
#> 2: loc1 cult1 cult1 40kg 50kg 13.684203
#> 3: loc1 cult1 cult1 40kg 50kg 9.404255
#> 4: loc1 cult1 cult1 40kg 50kg 4.665673
#> 5: loc1 cult1 cult1 40kg 50kg 13.684203
#> ---
#> 10256: loc3 cult5 cult5 no_fert 60kg 8.794829
#> 10257: loc3 cult5 cult5 no_fert 60kg 7.265345
#> 10258: loc3 cult5 cult5 no_fert 60kg 4.829337
#> 10259: loc3 cult5 cult5 no_fert 60kg 8.794829
#> 10260: loc3 cult5 cult5 no_fert 60kg 7.265345
#> yield_2 diff
#> 1: 14.556291 -9.89061803
#> 2: 14.556291 -0.87208812
#> 3: 14.556291 -5.15203544
#> 4: 4.568348 0.09732446
#> 5: 4.568348 9.11585437
#> ---
#> 10256: 7.854123 0.94070539
#> 10257: 7.854123 -0.58877881
#> 10258: 9.981001 -5.15166422
#> 10259: 9.981001 -1.18617243
#> 10260: 9.981001 -2.71565663
Created on 2022-10-26 with reprex v2.0.2
CodePudding user response:
A tidy dplyr solution with adding all values for each location to a new column, then filtering to remove the few identical combinations:
library(tidyverse)
myfunc <- function(df) {
df %>%
add_column(other = list(.)) %>%
unnest(other, names_sep = "_") %>%
filter(!(cultivar == other_cultivar & fert_dose == other_fert_dose)) %>%
mutate(diff = yield - other_yield)
}
datmeans <- dat %>%
group_by(location, cultivar, fert_dose) %>%
summarise(yield = mean(yield), .groups = "drop") %>%
group_split(location) %>%
map(myfunc) %>%
bind_rows()
CodePudding user response:
A data.table
solution that avoids a larger join and subsequent filtering. It will be fast and memory-efficient.
dat_mean <- setDT(dat)[,.(mean_yield = mean(yield)), location:fert_dose][, doseIdx := match(fert_dose, unique(fert_dose))]
rbindlist(
lapply(
parse(
text = c(
".(location, doseIdx > doseIdx)",
".(location, doseIdx < doseIdx)"
)
),
function(e) {
dat_mean[
dat_mean,
.(
location,
cultivar1 = cultivar,
fert_dose1 = fert_dose,
yield1 = mean_yield,
cultivar2 = i.cultivar,
fert_dose2 = i.fert_dose,
yield2 = i.mean_yield,
diff = mean_yield - i.mean_yield
),
on = eval(e)
]
}
)
)
#> location cultivar1 fert_dose1 yield1 cultivar2 fert_dose2 yield2 diff
#> 1: loc1 cult4 60kg 8.508675 cult1 no_fert 9.402345 -0.89367006
#> 2: loc1 cult2 50kg 8.969489 cult1 no_fert 9.402345 -0.43285621
#> 3: loc1 cult5 40kg 9.345814 cult1 no_fert 9.402345 -0.05653068
#> 4: loc1 cult1 60kg 10.119129 cult1 no_fert 9.402345 0.71678444
#> 5: loc1 cult4 50kg 9.638162 cult1 no_fert 9.402345 0.23581741
#> ---
#> 926: loc3 cult2 40kg 7.405098 cult5 60kg 8.442893 -1.03779484
#> 927: loc3 cult5 no_fert 6.963170 cult5 60kg 8.442893 -1.47972253
#> 928: loc3 cult1 50kg 9.444416 cult5 60kg 8.442893 1.00152346
#> 929: loc3 cult4 40kg 10.264777 cult5 60kg 8.442893 1.82188419
#> 930: loc3 cult2 no_fert 7.196217 cult5 60kg 8.442893 -1.24667516