Home > Software design >  Frequency of a variable by month
Frequency of a variable by month

Time:12-17

I am looking for colour frequency by month. I would like to make a line graph with each colour's percentage for each month. This is my data:


ID    color1   color2  color3  date
55    red     blue     NA     2020-03-15
67    yellow  NA       NA     2020-05-02
83    blue    yellow   NA     2020-05-17
78    red     yellow   blue   2020-05-15  
43    green   NA       NA     2021-01-27
29    yellow  green    NA     2021-01-03

I need something like this to graph. I need the month's article count as the denominator. So if the ID has multiple colours (for example, all IDs in 03/2020 are blue and red), the total percentage can be higher than 100.


Month     n  freq_blue freq_red  freq_yellow  freq_green %_blue %_red   _yellow %_green
03-2020   1    1        1          0           0            100  100     0       0
04-2020   0    0        0          0           0            0     0      0       0
05-2020   3    2        1          3           0            66.7  33.3   100     0
06-2020   0    0        0          0           0            0     0      0       0
07-2020   0    0        0          0           0            0     0      0       0
08-2020   0    0        0          0           0            0     0      0       0
09-2020   0    0        0          0           0            0     0      0       0
10-2020   0    0        0          0           0            0     0      0       0
11-2020   0    0        0          0           0            0     0      0       0
12-2020   0    0        0          0           0            0     0      0       0
01-2021   2    0        0          1           2            0     0      50     100

CodePudding user response:

As suggested, it is helpful if you could edit your original post/question with your code instead of in the comments below.

Based on what you have (and your prior question) this may be helpful.

Consider creating a new column month_total that you can use for percentage calculation - it appears you want to know the number of IDs for a given month (it was unclear if a color could appear more than once in a row).

After you determine the frequencies and percentages, and use complete to fill in missing months and colors, you can use fill to include the monthly totals as well.

library(tidyverse)
library(lubridate)

df$date <- as.Date(df$date)

df %>%
  mutate(month = month(date), year = year(date)) %>%
  pivot_longer(cols = starts_with("color")) %>%
  filter(!is.na(value)) %>%
  group_by(month, year) %>%
  mutate(month_total = n_distinct(ID)) %>%
  group_by(value, month_total, .add = T) %>%
  summarise(freq = n(), percent = freq/month_total[1] * 100) %>%
  ungroup() %>%
  complete(year, month = 1:12, value = c("blue", "red", "yellow", "green"), fill = list(freq = 0, percent = 0)) %>%
  group_by(year, month) %>%
  fill(month_total, .direction = "updown") %>%
  pivot_wider(id_cols = c(month, year, month_total), names_from = value, values_from = c(freq, percent)) %>%
  replace_na(list(month_total = 0))

Output

   month  year month_total freq_blue freq_green freq_red freq_yellow percent_blue percent_green percent_red percent_yellow
   <dbl> <dbl>       <dbl>     <dbl>      <dbl>    <dbl>       <dbl>        <dbl>         <dbl>       <dbl>          <dbl>
 1     1  2020           0         0          0        0           0          0               0         0                0
 2     2  2020           0         0          0        0           0          0               0         0                0
 3     3  2020           1         1          0        1           0        100               0       100                0
 4     4  2020           0         0          0        0           0          0               0         0                0
 5     5  2020           3         2          0        1           3         66.7             0        33.3            100
 6     6  2020           0         0          0        0           0          0               0         0                0
 7     7  2020           0         0          0        0           0          0               0         0                0
 8     8  2020           0         0          0        0           0          0               0         0                0
 9     9  2020           0         0          0        0           0          0               0         0                0
10    10  2020           0         0          0        0           0          0               0         0                0
# … with 14 more rows
  • Related