I would like to mutate new columns with residuals by solving an easy equation of linear regression.
Let's say that I have the table below which contains more columns with the same structure. Each linear function has the same dependent variable.
Dependent_VAR | X_Independent | X_Coefficient | X_Intercept | Y_Intependent | Y_Coefficient | Y_Intercept |
---|---|---|---|---|---|---|
1200 | 2000 | 3000 | 4500 | 500 | 750 | 600 |
3100 | 3000 | 3000 | 4500 | 670 | 750 | 600 |
52222 | 55500 | 3000 | 4500 | 700 | 750 | 600 |
400 | 456 | 3000 | 4500 | 800 | 750 | 600 |
I want to find the residual for each column name of the dataframe by creating new columns with the new name name_Residual like X_Residual = X_Independent-(X_Intercept X_Coefficient*Dependent_VAR). I would prefer a solution using dplyr.
CodePudding user response:
Another option would be to first reshape your data to tidy format using tidyr::pivot_longer
. Afterwards computing the residuals is pretty straightforward:
library(dplyr)
library(tidyr)
dat |>
pivot_longer(-Dependent_VAR, names_to = c("name", ".value"), names_sep = "_") |>
mutate(Residual = Independent - (Intercept Dependent_VAR * Coefficient))
#> # A tibble: 8 × 6
#> Dependent_VAR name Independent Coefficient Intercept Residual
#> <int> <chr> <int> <int> <int> <int>
#> 1 1200 X 2000 3000 4500 -3602500
#> 2 1200 Y 500 750 600 -900100
#> 3 3100 X 3000 3000 4500 -9301500
#> 4 3100 Y 670 750 600 -2324930
#> 5 52222 X 55500 3000 4500 -156615000
#> 6 52222 Y 700 750 600 -39166400
#> 7 400 X 456 3000 4500 -1204044
#> 8 400 Y 800 750 600 -299800
And if necessary you could of course reshape back to wide afterwards:
dat |>
pivot_longer(-Dependent_VAR, names_to = c("name", ".value"), names_sep = "_") |>
mutate(Residual = Independent - (Intercept Dependent_VAR * Coefficient)) |>
pivot_wider(names_from = name, values_from = !c(Dependent_VAR, name))
#> # A tibble: 4 × 9
#> Dependent_VAR Indepe…¹ Indep…² Coeff…³ Coeff…⁴ Inter…⁵ Inter…⁶ Resid…⁷ Resid…⁸
#> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1200 2000 500 3000 750 4500 600 -3.60e6 -9.00e5
#> 2 3100 3000 670 3000 750 4500 600 -9.30e6 -2.32e6
#> 3 52222 55500 700 3000 750 4500 600 -1.57e8 -3.92e7
#> 4 400 456 800 3000 750 4500 600 -1.20e6 -3.00e5
#> # … with abbreviated variable names ¹Independent_X, ²Independent_Y,
#> # ³Coefficient_X, ⁴Coefficient_Y, ⁵Intercept_X, ⁶Intercept_Y, ⁷Residual_X,
#> # ⁸Residual_Y
DATA
dat <- data.frame(
Dependent_VAR = c(1200L, 3100L, 52222L, 400L),
X_Independent = c(2000L, 3000L, 55500L, 456L),
X_Coefficient = c(3000L, 3000L, 3000L, 3000L),
X_Intercept = c(4500L, 4500L, 4500L, 4500L),
Y_Independent = c(500L, 670L, 700L, 800L),
Y_Coefficient = c(750L, 750L, 750L, 750L),
Y_Intercept = c(600L, 600L, 600L, 600L)
)
CodePudding user response:
Here is a possible approach, it uses !!sym(var)
to use the text stored in var
as a variable name. There are several other ways to do the same thing, see the [Programming with dplyr vignette][1].
prefix_list = c("X_","Y_")
for(prefix in prefix_list){
this_independent = paste0(prefix,"Independent")
this_coefficient = paste0(prefix,"Coefficient")
this_intercept = paste0(prefix,"Intercept")
this_interdependent = paste0(prefix,"Interpendent")
this_residual = paste0(prefix,"Residual")
df = df %>%
mutate(!!sym(this_residual) := !!sym(this_independent) - !!sym(this_intercept) - !!sym(this_coefficient) * !!sym(this_interdepdendent))
}
[1]: https://dplyr.tidyverse.org/articles/programming.html