I have a dataset with over 10,000 columns and 10,000 rows. I am trying to add values of rows based on column names.
The dataset looks something like this
data <- tibble(date = c('1/1/2018','2/1/2018','3/1/2018'),
x1 = c(1, 11, 111),
x2 = c(2, 22, 222),
x1_1 = c(3, 333, 333),
x2_1 = c(4, 44, 44),
x1_2 = c(5, 55, 555),
x2_2 = c(6, 66, 666),)
I am trying to create a new table which includes the date column, an x1 column and an x2 column where the value of x1 for row 1 = 1 3 5, value of x2 for row 2 = 22 44 66, etc.
Any help would be much appreciated.
CodePudding user response:
Here's a for loop approach:
library(stringr)
name_stems = unique(str_replace(names(data)[-1], "_.*", ""))
result = data[, "date", drop = FALSE]
for(i in seq_along(name_stems)) {
result[[name_stems[i]]] = rowSums(data[str_detect(names(data), name_stems[i])])
}
result
# # A tibble: 3 × 3
# date x1 x2
# <chr> <dbl> <dbl>
# 1 1/1/2018 9 12
# 2 2/1/2018 399 132
# 3 3/1/2018 999 932
CodePudding user response:
Using data.table
:
baseCols <- paste0('x', 1:2)
result <- setDT(data) |> melt(measure.vars = patterns(baseCols), value.name = baseCols)
result[, lapply(.SD, sum), by=.(date), .SDcols=baseCols]
## date x1 x2
## 1: 1/1/2018 9 12
## 2: 2/1/2018 399 132
## 3: 3/1/2018 999 932
CodePudding user response:
Your data is in the wide format. One way of achieving your goal is transforming the data into the long format, then grouping them based on indices (x1 and x2), compute the sums for each group for each date, and finally transform the results back to the wide formats to create columns based on the indices.
library(tidyverse)
data |>
pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
mutate(xgroup = substr(name, 1,2)) |>
group_by(date,xgroup) |>
summarise(xsums = sum(x.values)) |>
pivot_wider(values_from = xsums, names_from = xgroup )
# date x1 x2
# <chr> <dbl> <dbl>
#1 1/1/2018 9 12
#2 2/1/2018 399 132
#3 3/1/2018 999 932
Updates
In order to include only columns x1
and x1_
, and exclude any other column even though it starts with x1
, the following regular expression pattern can be used : "x1$|(x1_).*"
. The similar pattern can be used to include only columns x2
and x2_
. For example:
s <- c("x100_1", "x10", "x1", "x1_1", "x1_2", "x2", "x2_1", "x2_2", "x20", "x20_1")
s
#[1] "x100_1" "x10" "x1" "x1_1" "x1_2" "x2" "x2_1" "x2_2" "x20"
#[10] "x20_1"
s |> str_extract("x1$|(x1_).*")
#[1] NA NA "x1" "x1_1" "x1_2" NA NA NA NA NA
s |> str_extract("x2$|(x2_).*")
#[1] NA NA NA NA NA "x2" "x2_1" "x2_2" NA NA
This pattern can then be used to create a group that consists of x1
and x1_
columns only and another group that consists of x2
and x2_
columns only.
Here is the full code:
data |>
pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
mutate(xgroup = case_when(str_detect(name, "x1$|(x1_).*")~"x1",
str_detect(name, "x2$|(x2_).*")~"x2")) |>
group_by(date,xgroup) |>
summarise(xsums = sum(x.values)) |>
pivot_wider(values_from = xsums, names_from = xgroup )