I'm reading in some XY data from a file, and need to do a cumulative sum on the Y ordinate with steps coarser than that in the original data.
How can I do a cumsum on a the Y column based on X values in other columns?
library(tidyverse)
x_data <- seq(2, 5, 0.2) # this is read in from a file
y_data <- runif(length(x_data)) # this is read in from a file
input_data <- tibble(x_data, y_data)
x_steps <- seq(2, 5, 1)
final_data = tibble(x_steps)
#how to cumsum input_data$y_data based on final_data$x_steps?
#x_data and x_steps are guarenteed to be strictly increasing
*edit: Some example data:
# input_data
2.0 0.55005588
2.2 0.27577714
2.4 0.24215415
2.6 0.70794255
2.8 0.30632029
3.0 0.71693474
3.2 0.49712663
3.4 0.21227011
3.6 0.74823731
3.8 0.90176957
4.0 0.52331898
4.2 0.79409880
4.4 0.04228391
4.6 0.06220941
4.8 0.50969235
5.0 0.11734668
should turn into:
#final_data
2.0 0.55005588
3.0 2.79918475
4.0 5.68190735
5.0 7.2075385
CodePudding user response:
Alternative solution:
input_data %>%
mutate(id = as.integer(cut(x_data, breaks = 0:max(x_data)))) %>%
group_by(id) %>%
summarize(sum = sum(y_data)) %>%
mutate(cumsum_ydata = cumsum(sum))
CodePudding user response:
We can try
library(tidyverse)
input_data |>
mutate(x_steps = rev(trunc(x_data))) |>
group_by(x_steps) |>
summarise(cum_y_data = sum(y_data)) |>
mutate(cum_y_data = rev(cum_y_data)) |>
mutate(cum_y_data = cumsum(cum_y_data))
- output
# A tibble: 4 × 2
x_steps cum_y_data
<dbl> <dbl>
1 2 0.550
2 3 2.80
3 4 5.68
4 5 7.21
CodePudding user response:
Though not the most elegant solution, if you wanted/neded a brute-force base R approach:
want <- matrix(NA, ncol = 2, nrow = length(x_steps))
for(xx in seq_along(x_steps)){
res <- seq(x_steps[xx], x_steps[xx] 1, 0.0001)
want[xx,] <- c(x_steps[xx],
max(cumsum(input_data[input_data$x_data %in% res[-length(res)], "y_data"])))
}
Output:
#>want
#[,1] [,2]
#[1,] 2 3.308344
#[2,] 3 1.581712
#[3,] 4 2.660372
#[4,] 5 0.899825
If the steps are truly lower resolution, an easier dplyr()
approach may be to use floor()
then perform some basic functions:
input_data %>%
group_by(floor(x_data)) %>%
mutate(cum_sum = cumsum(y_data)) %>%
filter(row_number() == n()) %>%
select(cum_sum)
Output:
`floor(x_data)` cum_sum
<dbl> <dbl>
1 2 3.31
2 3 2.47
3 4 2.76
4 5 0.900
CodePudding user response:
Another tidyverse
approach: Last 2 lines same as @deschen:
Main challenge is to group for summing. Here we use a combination of coalesce
, str_extract
and parse_number
to group:
library(tidyverse)
input_data %>%
group_by(id = coalesce(parse_number(str_extract(x_data, '\\d \\.')) 1, x_data)) %>%
summarise(sum = sum(y_data)) %>%
mutate(cumsum_ydata = cumsum(sum))
id sum cumsum_ydata
<dbl> <dbl> <dbl>
1 2 0.874 0.874
2 3 49.2 50.1
3 4 48.7 98.8
4 5 45.0 144.
5 6 48.2 192.
6 7 49.1 241.
7 8 47.4 289.
8 9 46.1 335.
9 10 50.8 385.
10 11 51.1 437.
# ... with 154 more rows
CodePudding user response:
This is what I came up with:
x_steps <- seq(2, 5, 1)
final_data <- tibble(x_steps)
final_data <- final_data %>% mutate(id=cut(x_steps, breaks=c(-Inf, x_steps), labels=seq(2, 5, 1)))
x_data <- seq(2, 5, 0.2) # this is really read in from a file
y_data <- runif(length(x_data)) # this is really read in from a file
input_data <- tibble(x_data, y_data)
final_data <- input_data %>%
mutate(id=cut(x_data, breaks=c(-Inf, x_steps), labels=seq(2, 5, 1))) %>%
group_by(id) %>%
summarise(totals=sum(y_data)) %>%
mutate(csum = cumsum(totals)) %>%
select(c("id", "csum")) %>%
right_join(final_data, by="id") %>%
relocate(x_steps)
I stuck with the explicit ID column, as I wanted to join it back to the final_data
tibble, along with other calculations.
Part my own deductions, part taking from @deschen