dplyr- cumulative sum of a column, based on comparisons with other columns-CodePudding

I'm reading in some XY data from a file, and need to do a cumulative sum on the Y ordinate with steps coarser than that in the original data.

How can I do a cumsum on a the Y column based on X values in other columns?

library(tidyverse)
x_data <- seq(2, 5, 0.2) # this is read in from a file
y_data <- runif(length(x_data)) # this is read in from a file
input_data <- tibble(x_data, y_data)

x_steps <- seq(2, 5, 1)
final_data = tibble(x_steps)

#how to cumsum input_data$y_data based on final_data$x_steps?
#x_data and x_steps are guarenteed to be strictly increasing

*edit: Some example data:

# input_data
2.0   0.55005588
2.2   0.27577714
2.4   0.24215415
2.6   0.70794255
2.8   0.30632029
3.0   0.71693474
3.2   0.49712663
3.4   0.21227011
3.6   0.74823731
3.8   0.90176957
4.0   0.52331898
4.2   0.79409880
4.4   0.04228391
4.6   0.06220941
4.8   0.50969235
5.0   0.11734668

should turn into:

#final_data
2.0   0.55005588
3.0   2.79918475
4.0   5.68190735
5.0   7.2075385

CodePudding user response：

Alternative solution:

input_data %>%
  mutate(id = as.integer(cut(x_data, breaks = 0:max(x_data)))) %>%
  group_by(id) %>%
  summarize(sum = sum(y_data)) %>%
  mutate(cumsum_ydata = cumsum(sum))

CodePudding user response：

We can try

library(tidyverse)

input_data |>
           mutate(x_steps = rev(trunc(x_data))) |>
           group_by(x_steps) |> 
           summarise(cum_y_data = sum(y_data)) |>
           mutate(cum_y_data = rev(cum_y_data)) |> 
           mutate(cum_y_data = cumsum(cum_y_data))

output

# A tibble: 4 × 2
  x_steps cum_y_data
    <dbl>      <dbl>
1       2      0.550
2       3      2.80 
3       4      5.68 
4       5      7.21

CodePudding user response：

Though not the most elegant solution, if you wanted/neded a brute-force base R approach:

want <- matrix(NA, ncol = 2, nrow = length(x_steps))
for(xx in seq_along(x_steps)){
  res <- seq(x_steps[xx], x_steps[xx]   1, 0.0001)
  want[xx,] <- c(x_steps[xx], 
                 max(cumsum(input_data[input_data$x_data %in% res[-length(res)], "y_data"])))
}

Output:

#>want
#[,1]     [,2]
#[1,]    2 3.308344
#[2,]    3 1.581712
#[3,]    4 2.660372
#[4,]    5 0.899825

If the steps are truly lower resolution, an easier dplyr() approach may be to use floor() then perform some basic functions:


input_data %>% 
  group_by(floor(x_data)) %>% 
  mutate(cum_sum = cumsum(y_data)) %>%
  filter(row_number() == n()) %>%
  select(cum_sum)

Output:

  `floor(x_data)` cum_sum
            <dbl>   <dbl>
1               2   3.31 
2               3   2.47 
3               4   2.76 
4               5   0.900

CodePudding user response：

Another tidyverse approach: Last 2 lines same as @deschen:

Main challenge is to group for summing. Here we use a combination of coalesce, str_extract and parse_number to group:

library(tidyverse)

input_data %>% 
  group_by(id = coalesce(parse_number(str_extract(x_data, '\\d \\.')) 1, x_data)) %>% 
  summarise(sum = sum(y_data)) %>% 
  mutate(cumsum_ydata = cumsum(sum))

      id    sum cumsum_ydata
   <dbl>  <dbl>        <dbl>
 1     2  0.874        0.874
 2     3 49.2         50.1  
 3     4 48.7         98.8  
 4     5 45.0        144.   
 5     6 48.2        192.   
 6     7 49.1        241.   
 7     8 47.4        289.   
 8     9 46.1        335.   
 9    10 50.8        385.   
10    11 51.1        437.   
# ... with 154 more rows

CodePudding user response：

This is what I came up with:

x_steps <- seq(2, 5, 1)
final_data <-  tibble(x_steps)
final_data <- final_data  %>% mutate(id=cut(x_steps, breaks=c(-Inf, x_steps), labels=seq(2, 5, 1)))

x_data <- seq(2, 5, 0.2) # this is really read in from a file
y_data <- runif(length(x_data)) # this is really read in from a file
input_data <- tibble(x_data, y_data)

final_data <- input_data %>% 
  mutate(id=cut(x_data, breaks=c(-Inf, x_steps), labels=seq(2, 5, 1))) %>% 
  group_by(id) %>% 
  summarise(totals=sum(y_data)) %>% 
  mutate(csum = cumsum(totals)) %>% 
  select(c("id", "csum")) %>% 
  right_join(final_data, by="id") %>% 
  relocate(x_steps)

I stuck with the explicit ID column, as I wanted to join it back to the final_data tibble, along with other calculations.

Part my own deductions, part taking from @deschen