Calculate values based on matched substrings within names-CodePudding

I am trying to identify column names with matching substrings, and then calculate the differences of the values in those columns.

Sample data:

V1_ABC <- c(1,2,3,4)
V2_ABC <- c(2,3,4,5)
V1_WXYZ <- c(10,11,12,13)
V2_WXYZ <- c(11,12,13,14)
Date <- c(2001,2002,2003,2004)

So df looks like:

df <- data.frame(Date, V1_ABC, V2_ABC, V1_WXYZ, V2_WXYZ)

  Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ
1 2001      1      2     10     11
2 2002      2      3     11     12
3 2003      3      4     12     13
4 2004      4      5     13     14

I want to calculate V1 minus V2 for ABC and WXYZ. My original dataset is much larger, so I don't want to do this manually for each. I'd like to automate this so that R compares the column headers and finds which columns have the same ending substring (V1_ABC and V2_ABC, and V1_WXYZ and V2_WXYZ), then subtracts the V2_ from the V1_. Like this:

  Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ dif_ABC dif_WXYZ
1 2001      1      2      10      11      -1       -1
2 2002      2      3      11      12      -1       -1
3 2003      3      4      12      13      -1       -1
4 2004      4      5      13      14      -1       -1

Most of the functions I have found such as grep or intersect either look for a specific string you input, or return the values where the vectors are the same.

Any ideas on how to automate pairing based on names/substrings?

CodePudding user response：

You could stack V1 and V2 separately, calculate the differences, and reshape them back to the wide form. This approach can deal with any numbers of pairs of V1_xxx and V2_xxx.

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(-Date, names_to = c(".value", "grp"), names_sep = "_") %>%
  mutate(dif = V1 - V2) %>%
  pivot_wider(names_from = grp, values_from = c(V1, V2, dif))

# # A tibble: 4 × 7
#    Date V1_ABC V1_WXYZ V2_ABC V2_WXYZ dif_ABC dif_WXYZ
#   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
# 1  2001      1      10      2      11      -1       -1
# 2  2002      2      11      3      12      -1       -1
# 3  2003      3      12      4      13      -1       -1
# 4  2004      4      13      5      14      -1       -1

CodePudding user response：

Here is a base R solution. You mention that your data frame is large so this checks for columns where there are exactly 2 shared suffixes and only operates on those. It assumes that they are all of the format "V1_suffix" and "V2_suffix" but could be easily modified if they are in other formats.

suffixes <- unlist(regmatches(names(df), gregexpr("_. ", names(df))))

# Limit to suffixes where there are 2
suffixes <- names(table(suffixes)[table(suffixes) == 2])

diffs  <- sapply(suffixes, 
    \(suffix) df[[paste0("V1", suffix)]] - df[[paste0("V2", suffix)]]
)


diff_df  <- data.frame(diffs)  |> 
    setNames(paste0("dif", suffixes))

cbind(df, diff_df)

#   V1_ABC V2_ABC V1_WXYZ V2_WXYZ dif_ABC dif_WXYZ
# 1      1      2      10      11      -1       -1
# 2      2      3      11      12      -1       -1
# 3      3      4      12      13      -1       -1
# 4      4      5      13      14      -1       -1