Grouping and Counting by Dates (R)-CodePudding

I am working with the R programming language. I have a data frame that looks like this:

  startdate <- c('2010-01-01','2010-01-01','2010-01-01', '2010-01-02','2010-01-03','2010-01-03')

event <- c(1,1,1,1,1,1)
    
 my_data <- data.frame(startdate, event)

   startdate event
1 2010-01-01     1
2 2010-01-01     1
3 2010-01-01     1
4 2010-01-02     1
5 2010-01-03     1
6 2010-01-03     1

Note: The actual value of "startdate" is "POSIXct" and is written as "year-month-date".

I am trying to take a cumulative sum of "event" according to the "startdate" column. The result should look like this

  startdate <- c('2010-01-01', '2010-01-02' ,'2010-01-03')

event <- c(3,4,6)
    
 my_data_2 <- data.frame(startdate, event)

#desired file
   startdate event
1 2010-01-01     3
2 2010-01-02     4
3 2010-01-03     6

I tried to do this with the "dplyr" library:

library(dplyr)

new_file = my_data %>% group_by(startdate) %>% mutate(cumsum_value = cumsum(event))

But this returns something slightly different and non-intended:

 startdate  event cumsum_value
  <chr>      <dbl>        <dbl>
1 2010-01-01     1            1
2 2010-01-01     1            2
3 2010-01-01     1            3
4 2010-01-02     1            1
5 2010-01-03     1            1
6 2010-01-03     1            2

Can someone please show me how to fix this?

Thanks

CodePudding user response：

my_data %>%
  mutate(cumsum = cumsum(event)) %>%
  group_by(startdate) %>%
  summarise(max(cumsum))

# A tibble: 3 × 2
  startdate  `max(cumsum)`
  <chr>              <dbl>
1 2010-01-01             3
2 2010-01-02             4
3 2010-01-03             6

CodePudding user response：

mutate the event column and calculate cumsum
group_by startdate and
summarise max(event)

library(dplyr)
my_data %>%
    mutate(event = cumsum(event)) %>% 
    group_by(startdate) %>% 
    summarise(event = max(event))
```
```
  startdate  event
  <chr>      <dbl>
1 2010-01-01     3
2 2010-01-02     4
3 2010-01-03     6
```

CodePudding user response：

Another option is also to make use of duplicated and thus avoiding the group_by. Also, if the 'event' column is just 1, instead of doing cumsum, we could use the built-in function row_number() to create a sequence

library(dplyr)
my_data %>%
   mutate(event = row_number()) %>% 
   filter(!duplicated(startdate, fromLast = TRUE))