I want help to calculate the RMSE of two groups from the dataset looking like this:
structure(list(machine = c("B", "B", "B", "B", "B", "B", "B",
"B", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"),
measured = c(14.47, 15.33, 18.56, 14.89, 17.24, 16.25, 13,
20.52, 18.06, 13.09, 16.88, 15.92, 14.47, 18.63, 13.88, 16.32,
13.83, 11.67, 13.42), predicted = c(15.83, 16, 17.87, 14.21,
17.77, 14.14, 12.01, 19.31, 16.98, 13.19, 15.6, 17.16, 16.07,
17.38, 17.99, 17.86, 18.54, 10.79, 16.06)), class = "data.frame", row.names = c(NA,
-19L))
I want to calculate RMSE for each Machine and if possible add it to my scatterplot.
I attempted this
fr <- read.csv(file.choose())
ggplot(fr, aes(measured, predicted, colour = machine))
geom_point(size=2)
geom_smooth(method="lm",se=FALSE)
stat_poly_eq(aes(label = paste(after_stat(eq.label),
after_stat(rr.label), sep = "*\", \"*")))
theme_set(theme_bw(base_size=16))
theme(axis.line = element_line(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
I couldn’t find a way to automatically calculate the RMSE for my model.
CodePudding user response:
In the future, please heed the comment and post your sample data as text using dput
, not a screenshot.
This is a classic "split-apply-combine" problem:
- Split the data by machine type
- Apply the function you want to use, in this case RMSE
- Combine the results.
There are many, many ways to do this, using various libraries, such as dplyr
. If you search for "split apply combine" and the dplyr
tag on this site, you will find many examples.
Here is one solution using base R, using a somewhat-fancy technique from the discipline of "functional programming":
rmse <- function(meas, pred) {
sqrt(mean((meas - pred)^2))
}
apply_columns <- function(func, use_names = FALSE) {
function(data) {
if (!use_names) {
data <- unname(data)
}
do.call(func, data)
}
}
machine_rmse <- by(fr, fr$Machine, apply_columns(rmse), simplify = FALSE)
The resulting groups
object is a list, which you can convert to a data frame and then use in your GGPlot routine:
machine_rmse <- data.frame(
name = names(machine_rmse),
val = unlist(as.list(machine_rmse)))
However it's unclear from your question how exactly you want to display this information in your plot, so I can't advise further on how to incorporate this with ggplot
code.
CodePudding user response:
Here is a way by
. The idea of using by
must be credited to shadowtalker's answer.
measured <- c(4.37, 3.36, 2.5, 2.96, 4.31, 3.69)
predicted <- c(4.47, 3.07, 3, 3.95, 3.3, 3.59)
machine <- c(rep("A", 3), rep("B", 3))
fr <- data.frame(machine, measured, predicted)
rmse <- function(data) {
x <- data[[1]]
y <- data[[2]]
sqrt(mean((x - y)^2))
}
rmse <- by(fr[-1], fr$machine, rmse)
data.frame(rmse = unclass(rmse))
#> rmse
#> A 0.3386739
#> B 0.8185760
Created on 2022-11-17 with reprex v2.0.2