Home > Net >  Taking rolling differences of columns in R tibble for arbitrary number of columns
Taking rolling differences of columns in R tibble for arbitrary number of columns

Time:12-04

I want to take differences for each pair of consecutive columns but for an arbitrary number of columns. For example...

df <- as.tibble(data.frame(group = rep(c("a", "b", "c"), each = 4),
                       subgroup = rep(c("adam", "boy", "charles", "david"), times = 3),
                       iter1 = 1:12,
                       iter2 = c(13:22, NA, 24),
                       iter3 = c(25:35, NA)))

I want to calculate the differences by column. I would normally use...

df %>%
  mutate(diff_iter2 = iter2 - iter1,
         diff_iter3 = iter3 - iter2)

But... I'd like to:

  1. accomodate an arbitrary number of columns and
  2. treat NAs such that:
    if the number we're subtracting from is NA, then the result should be NA. E.g. NA - 11 = NA
    if the number we're subtracting is NA, then that NA is effectively treated as a 0. E.g. 35 - NA = 35

The result should look like this...

   group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
   <chr> <chr>    <int> <dbl> <int>      <dbl>      <dbl>
 1 a     adam         1    13    25         12         12
 2 a     boy          2    14    26         12         12
 3 a     charles      3    15    27         12         12
 4 a     david        4    16    28         12         12
 5 b     adam         5    17    29         12         12
 6 b     boy          6    18    30         12         12
 7 b     charles      7    19    31         12         12
 8 b     david        8    20    32         12         12
 9 c     adam         9    21    33         12         12
10 c     boy         10    22    34         12         12
11 c     charles     11    NA    35         NA         35
12 c     david       12    24    NA         12         NA

Originally, this df was in long format but the problem was that I believe the lag() function operates on position within groups and all the groups aren't the same because some have missing records (hence the NA in the wider table shown above).

Starting with long format would do but then please assume the records shown above with NA values would not exist in that longer dataframe.

Any help is appreciated.

CodePudding user response:

An option in tidyverse would be - loop across the columns of 'iter' other than the iter1, then get the column value by replacing the column name (cur_column()) substring by subtracting 1 (as.numeric(x) -1) with str_replace, then replace the NA elements with 0 (replace_na) based on the OP's logic, subtract from the looped column and create new columns by adding prefix in .names ("diff_{.col}" - {.col} will be the original column name)

library(dplyr)
library(stringr)
library(tidyr)
df <- df %>% 
    mutate(across(iter2:iter3, ~
     . - replace_na(get(str_replace(cur_column(), '\\d ', 
     function(x) as.numeric(x) - 1)), 0), .names = 'diff_{.col}'))

-output

df
# A tibble: 12 × 7
   group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
   <chr> <chr>    <int> <dbl> <int>      <dbl>      <dbl>
 1 a     adam         1    13    25         12         12
 2 a     boy          2    14    26         12         12
 3 a     charles      3    15    27         12         12
 4 a     david        4    16    28         12         12
 5 b     adam         5    17    29         12         12
 6 b     boy          6    18    30         12         12
 7 b     charles      7    19    31         12         12
 8 b     david        8    20    32         12         12
 9 c     adam         9    21    33         12         12
10 c     boy         10    22    34         12         12
11 c     charles     11    NA    35         NA         35
12 c     david       12    24    NA         12         NA

CodePudding user response:

Find the columns whose names start with iter, ix, and then take all but the first as df1, all but the last as df2 and replace the NAs in df2 with 0. Then subtract them and cbind df to that. No packages are used.

ix <- grep("^iter", names(df))
df1 <- df[tail(ix, -1)]
df2 <- df[head(ix, -1)]
df2[is.na(df2)] <- 0
cbind(df, diff = df1 - df2)

giving:

   group subgroup iter1 iter2 iter3 diff.iter2 diff.iter3
1      a     adam     1    13    25         12         12
2      a      boy     2    14    26         12         12
3      a  charles     3    15    27         12         12
4      a    david     4    16    28         12         12
5      b     adam     5    17    29         12         12
6      b      boy     6    18    30         12         12
7      b  charles     7    19    31         12         12
8      b    david     8    20    32         12         12
9      c     adam     9    21    33         12         12
10     c      boy    10    22    34         12         12
11     c  charles    11    NA    35         NA         35
12     c    david    12    24    NA         12         NA
  • Related