R sum row values based on column name-CodePudding

I have a dataset with over 10,000 columns and 10,000 rows. I am trying to add values of rows based on column names.

The dataset looks something like this

data <- tibble(date = c('1/1/2018','2/1/2018','3/1/2018'),
              x1 = c(1, 11, 111),
              x2 = c(2, 22, 222),
              x1_1 = c(3, 333, 333),
              x2_1 = c(4, 44, 44),
              x1_2 = c(5, 55, 555),
              x2_2 = c(6, 66, 666),)

I am trying to create a new table which includes the date column, an x1 column and an x2 column where the value of x1 for row 1 = 1 3 5, value of x2 for row 2 = 22 44 66, etc.

Any help would be much appreciated.

CodePudding user response：

Here's a for loop approach:

library(stringr)
name_stems = unique(str_replace(names(data)[-1], "_.*", ""))
result = data[, "date", drop = FALSE]
for(i in seq_along(name_stems)) {
  result[[name_stems[i]]] = rowSums(data[str_detect(names(data), name_stems[i])])
}

result
# # A tibble: 3 × 3
#   date        x1    x2
#   <chr>    <dbl> <dbl>
# 1 1/1/2018     9    12
# 2 2/1/2018   399   132
# 3 3/1/2018   999   932

CodePudding user response：

Using data.table:

baseCols <- paste0('x', 1:2)
result <- setDT(data) |> melt(measure.vars = patterns(baseCols), value.name = baseCols)
result[, lapply(.SD, sum), by=.(date), .SDcols=baseCols]
##        date  x1  x2
## 1: 1/1/2018   9  12
## 2: 2/1/2018 399 132
## 3: 3/1/2018 999 932

CodePudding user response：

Your data is in the wide format. One way of achieving your goal is transforming the data into the long format, then grouping them based on indices (x1 and x2), compute the sums for each group for each date, and finally transform the results back to the wide formats to create columns based on the indices.

library(tidyverse)

data |> 
    pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
    mutate(xgroup = substr(name, 1,2)) |> 
    group_by(date,xgroup) |>
    summarise(xsums = sum(x.values)) |> 
    pivot_wider(values_from = xsums, names_from = xgroup )

#  date        x1    x2
#  <chr>    <dbl> <dbl>
#1 1/1/2018     9    12
#2 2/1/2018   399   132
#3 3/1/2018   999   932

Updates

In order to include only columns x1 and x1_, and exclude any other column even though it starts with x1, the following regular expression pattern can be used : "x1$|(x1_).*". The similar pattern can be used to include only columns x2 and x2_. For example:

s <- c("x100_1", "x10", "x1", "x1_1", "x1_2", "x2", "x2_1", "x2_2", "x20", "x20_1")
s
#[1] "x100_1" "x10"    "x1"     "x1_1"   "x1_2"   "x2"     "x2_1"   "x2_2"   "x20"   
#[10] "x20_1" 

s |> str_extract("x1$|(x1_).*")
#[1] NA     NA     "x1"   "x1_1" "x1_2" NA     NA     NA     NA     NA

s |> str_extract("x2$|(x2_).*")
#[1] NA     NA     NA     NA     NA     "x2"   "x2_1" "x2_2" NA     NA

This pattern can then be used to create a group that consists of x1 and x1_ columns only and another group that consists of x2 and x2_ columns only.

Here is the full code:

data |> 
    pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
    mutate(xgroup = case_when(str_detect(name, "x1$|(x1_).*")~"x1",
                              str_detect(name, "x2$|(x2_).*")~"x2")) |>
    group_by(date,xgroup) |>
    summarise(xsums = sum(x.values)) |> 
    pivot_wider(values_from = xsums, names_from = xgroup )