I am trying to identify column names with matching substrings, and then calculate the differences of the values in those columns.
Sample data:
V1_ABC <- c(1,2,3,4)
V2_ABC <- c(2,3,4,5)
V1_WXYZ <- c(10,11,12,13)
V2_WXYZ <- c(11,12,13,14)
Date <- c(2001,2002,2003,2004)
So df looks like:
df <- data.frame(Date, V1_ABC, V2_ABC, V1_WXYZ, V2_WXYZ)
Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ
1 2001 1 2 10 11
2 2002 2 3 11 12
3 2003 3 4 12 13
4 2004 4 5 13 14
I want to calculate V1 minus V2 for ABC and WXYZ. My original dataset is much larger, so I don't want to do this manually for each. I'd like to automate this so that R compares the column headers and finds which columns have the same ending substring (V1_ABC and V2_ABC, and V1_WXYZ and V2_WXYZ), then subtracts the V2_ from the V1_. Like this:
Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ dif_ABC dif_WXYZ
1 2001 1 2 10 11 -1 -1
2 2002 2 3 11 12 -1 -1
3 2003 3 4 12 13 -1 -1
4 2004 4 5 13 14 -1 -1
Most of the functions I have found such as grep or intersect either look for a specific string you input, or return the values where the vectors are the same.
Any ideas on how to automate pairing based on names/substrings?
CodePudding user response:
You could stack V1
and V2
separately, calculate the differences, and reshape them back to the wide form. This approach can deal with any numbers of pairs of V1_xxx
and V2_xxx
.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-Date, names_to = c(".value", "grp"), names_sep = "_") %>%
mutate(dif = V1 - V2) %>%
pivot_wider(names_from = grp, values_from = c(V1, V2, dif))
# # A tibble: 4 × 7
# Date V1_ABC V1_WXYZ V2_ABC V2_WXYZ dif_ABC dif_WXYZ
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2001 1 10 2 11 -1 -1
# 2 2002 2 11 3 12 -1 -1
# 3 2003 3 12 4 13 -1 -1
# 4 2004 4 13 5 14 -1 -1
CodePudding user response:
Here is a base R solution. You mention that your data frame is large so this checks for columns where there are exactly 2 shared suffixes and only operates on those. It assumes that they are all of the format "V1_suffix"
and "V2_suffix"
but could be easily modified if they are in other formats.
suffixes <- unlist(regmatches(names(df), gregexpr("_. ", names(df))))
# Limit to suffixes where there are 2
suffixes <- names(table(suffixes)[table(suffixes) == 2])
diffs <- sapply(suffixes,
\(suffix) df[[paste0("V1", suffix)]] - df[[paste0("V2", suffix)]]
)
diff_df <- data.frame(diffs) |>
setNames(paste0("dif", suffixes))
cbind(df, diff_df)
# V1_ABC V2_ABC V1_WXYZ V2_WXYZ dif_ABC dif_WXYZ
# 1 1 2 10 11 -1 -1
# 2 2 3 11 12 -1 -1
# 3 3 4 12 13 -1 -1
# 4 4 5 13 14 -1 -1