Home > Back-end >  How to count occurrences by day using dplyr
How to count occurrences by day using dplyr

Time:02-11

I am trying to use dplyr to do some basic statistics. The two parts of my data that I’m interested in are the dates and the outcomes of an event.

My data has three events that occur, reward, stop, or none (meaning neither a reward or stop occurred). What I want to do is calculate which days have the highest average of rewards by day. I was also hoping to try and obtain the count for overall occurrences, number of stops, rewards, and nones per day.

I have had some success obtaining the unique days and the overall occurrences per day. However, I am struggling to get the remaining data. When I try to adjust the group_by, it ends up causing issues trying to find the unique days.

df %>%
  mutate(ind, ind2 = case_when(ind=="Reward"~1, ind=="Stop"~0, ind=="None"~0)) %>%
  group_by(time2) %>%
  count(time2, sort = TRUE) 

Here, I try to create a new column that converts the event to binary formate so I could then try and calculate the average reward per day. This code is not required in an answer, just an example

Desires output:

Date        num_occ stop    reward  none        avg_reward
2022-01-03  3   1   1   1   0.3333333
2022-01-04  9   5   3   1   0.3333333
2022-01-05  2   1   1   0   0.5
2022-01-06  3   3   0   0   0

My question is, how can I calculate the average reward occurrences per day as well as obtain count information regarding the number of overall (reward, stop, none) occurrences per day, number of stops per day, number of rewards per day, and number of nones per day?

Example data:

structure(list(values = c(0, 18, 3, 2, 1, 9, 15, 13, 0, 12, 8, 
2, 3, 7, 6, 3), ind = structure(c(1L, 2L, 3L, 3L, 3L, 3L, 2L, 
2L, 1L, 2L, 3L, 3L, 2L, 3L, 3L, 3L), .Label = c("None", "Reward", 
"Stop"), class = "factor"), entry = c(TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
TRUE), time = structure(c(1641234180, 1641235020, 1641308400, 
1641312840, 1641312900, 1641316920, 1641322920, 1641325080, 1641325560, 
1641328740, 1641329220, 1641393900, 1641412140, 1641491040, 1641491640, 
1641493200), class = c("POSIXct", "POSIXt"), tzone = ""), time2 = structure(c(18995, 
18995, 18996, 18996, 18996, 18996, 18996, 18996, 18996, 18996, 
18996, 18997, 18997, 18998, 18998, 18998), class = "Date")), row.names = c(NA, 
16L), class = "data.frame")

CodePudding user response:

Something like this may work for you

library(tidyverse)

example_data <- structure(list(values = c(0, 18, 3, 2, 1, 9, 15, 13, 0, 12, 8, 
                          2, 3, 7, 6, 3), ind = structure(c(1L, 2L, 3L, 3L, 3L, 3L, 2L, 
                                                            2L, 1L, 2L, 3L, 3L, 2L, 3L, 3L, 3L), .Label = c("None", "Reward", 
                                                                                                            "Stop"), class = "factor"), entry = c(TRUE, TRUE, TRUE, TRUE, 
                                                                                                                                                  TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
                                                                                                                                                  TRUE), time = structure(c(1641234180, 1641235020, 1641308400, 
                                                                                                                                                                            1641312840, 1641312900, 1641316920, 1641322920, 1641325080, 1641325560, 
                                                                                                                                                                            1641328740, 1641329220, 1641393900, 1641412140, 1641491040, 1641491640, 
                                                                                                                                                                            1641493200), class = c("POSIXct", "POSIXt"), tzone = ""), time2 = structure(c(18995, 
                                                                                                                                                                                                                                                          18995, 18996, 18996, 18996, 18996, 18996, 18996, 18996, 18996, 
                                                                                                                                                                                                                                                          18996, 18997, 18997, 18998, 18998, 18998), class = "Date")), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                     16L), class = "data.frame")
example_data |> 
  group_by(day = time |> lubridate::as_date()) |> 
  summarise(num_occ = n(),
            stop = length(ind[ind == 'Stop']),
            Reward = length(ind[ind == 'Reward']),
            None = length(ind[ind == 'None']),
            sum_reward = sum(values[ind[ind == 'Reward']])
            )
#> # A tibble: 4 x 6
#>   day        num_occ  stop Reward  None sum_reward
#>   <date>       <int> <int>  <int> <int>      <dbl>
#> 1 2022-01-03       2     0      1     1         18
#> 2 2022-01-04       9     5      3     1          6
#> 3 2022-01-05       2     1      1     0          3
#> 4 2022-01-06       3     3      0     0          0

Created on 2022-02-10 by the reprex package (v2.0.1)

CodePudding user response:

Using @Bruno's sample data (thanks!)

## count totals per day
d1 <- (example_data
  %>% count(time2)
)
## count number of each event type per day; convert to wide format
d2 <- (example_data
  %>% count(time2, ind)
  %>% pivot_wider(names_from = "ind", values_from = n)
  %>% replace_na(list(None = 0, Reward = 0, Stop = 0))
)
## combine and compute averages
(full_join(d1, d2, by = "time2")
  %>% mutate(avg_reward = Reward/n)
)
  • Related