How to calculate RMSE for two groups of a dataset in R-CodePudding

I want help to calculate the RMSE of two groups from the dataset looking like this:

structure(list(machine = c("B", "B", "B", "B", "B", "B", "B", 
"B", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"), 
    measured = c(14.47, 15.33, 18.56, 14.89, 17.24, 16.25, 13, 
    20.52, 18.06, 13.09, 16.88, 15.92, 14.47, 18.63, 13.88, 16.32, 
    13.83, 11.67, 13.42), predicted = c(15.83, 16, 17.87, 14.21, 
    17.77, 14.14, 12.01, 19.31, 16.98, 13.19, 15.6, 17.16, 16.07, 
    17.38, 17.99, 17.86, 18.54, 10.79, 16.06)), class = "data.frame", row.names = c(NA, 
-19L))

I want to calculate RMSE for each Machine and if possible add it to my scatterplot.

I attempted this

fr <- read.csv(file.choose())

ggplot(fr, aes(measured, predicted, colour = machine))  
  geom_point(size=2) 
  geom_smooth(method="lm",se=FALSE)  
  stat_poly_eq(aes(label = paste(after_stat(eq.label),
                                 after_stat(rr.label), sep = "*\", \"*"))) 
  theme_set(theme_bw(base_size=16)) 
  theme(axis.line = element_line(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())

I couldn’t find a way to automatically calculate the RMSE for my model.

CodePudding user response：

In the future, please heed the comment and post your sample data as text using dput, not a screenshot.

This is a classic "split-apply-combine" problem:

Split the data by machine type
Apply the function you want to use, in this case RMSE
Combine the results.

There are many, many ways to do this, using various libraries, such as dplyr. If you search for "split apply combine" and the dplyr tag on this site, you will find many examples.

Here is one solution using base R, using a somewhat-fancy technique from the discipline of "functional programming":

rmse <- function(meas, pred) {
  sqrt(mean((meas - pred)^2))
}

apply_columns <- function(func, use_names = FALSE) {
  function(data) {
    if (!use_names) {
      data <- unname(data)
    }
    do.call(func, data)
  }
}

machine_rmse <- by(fr, fr$Machine, apply_columns(rmse), simplify = FALSE)

The resulting groups object is a list, which you can convert to a data frame and then use in your GGPlot routine:

machine_rmse <- data.frame(
  name = names(machine_rmse),
  val = unlist(as.list(machine_rmse)))

However it's unclear from your question how exactly you want to display this information in your plot, so I can't advise further on how to incorporate this with ggplot code.

CodePudding user response：

Here is a way by. The idea of using by must be credited to shadowtalker's answer.

measured <- c(4.37, 3.36, 2.5, 2.96, 4.31, 3.69)
predicted <- c(4.47, 3.07, 3, 3.95, 3.3, 3.59)
machine <- c(rep("A", 3), rep("B", 3))
fr <- data.frame(machine, measured, predicted)

rmse <- function(data) {
  x <- data[[1]]
  y <- data[[2]]
  sqrt(mean((x - y)^2))
}

rmse <- by(fr[-1], fr$machine, rmse)
data.frame(rmse = unclass(rmse))
#>        rmse
#> A 0.3386739
#> B 0.8185760

^{Created on 2022-11-17 with reprex v2.0.2}