I am trying to get frequency of the colors by month. Eventually I want to make a line graph with a line for each color, showing what percentage of all colors it was for each month. My data looks like this:
ID color color2 color3 date
55 red blue NA 2020-03-15
67 yellow NA NA 2020-05-02
83 blue yellow NA 2020-05-17
78 red yellow blue 2020-12-15
43 green NA NA 2021-01-27
29 yellow green NA 2021-01-03
I need something like this so I can make the graph:
Month freq_blue freq_red freq_yellow freq_green %_blue %_red %_yellow %_green
03-2020 1 1 0 0 50 50 0 0
04-2020 0 0 0 0 0 0 0 0
05-2020 1 0 2 0 33.3 0 66.7 0
06-2020 0 0 0 0 0 0 0 0
07-2020 0 0 0 0 0 0 0 0
08-2020 0 0 0 0 0 0 0 0
09-2020 0 0 0 0 0 0 0 0
10-2020 0 0 0 0 0 0 0 0
11-2020 0 0 0 0 0 0 0 0
12-2020 1 1 1 0 33.3 33.3 33.3 0
01-2021 0 0 1 2 0 0 33.3 66.7
I'm trying to group by month and then calculate the frequencies but I'm having some trouble because there are multiple color
columns.
CodePudding user response:
Here is a solution using tidyverse
(and lubridate
).
First, you can put your colors into long format with pivot_longer
. You can remove rows that have missing colors/values.
Then you can calculate frequencies with count
after grouping by the month, year, and color (value). By grouping by month and year, you can calculate proportion (percentage).
Finally, you can put your data into wide form if desired with pivot_wider
. The function complete
is also used to fill in missing months and colors.
library(tidyverse)
library(lubridate)
df$date <- as.Date(df$date)
df %>%
mutate(month = month(date), year = year(date)) %>%
pivot_longer(cols = starts_with("color")) %>%
filter(!is.na(value)) %>%
group_by(month, year, value) %>%
count() %>%
group_by(month, year) %>%
mutate(percent = n/sum(n)) %>%
ungroup() %>%
complete(year, month = 1:12, value = c("blue", "red", "yellow", "green"), fill = list(n = 0, percent = 0)) %>%
pivot_wider(id_cols = c(month, year), names_from = value, values_from = c(n, percent))
Output
# A tibble: 24 × 10
month year n_blue n_green n_red n_yellow percent_blue percent_green percent_red percent_yellow
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2020 0 0 0 0 0 0 0 0
2 2 2020 0 0 0 0 0 0 0 0
3 3 2020 1 0 1 0 0.5 0 0.5 0
4 4 2020 0 0 0 0 0 0 0 0
5 5 2020 1 0 0 2 0.333 0 0 0.667
6 6 2020 0 0 0 0 0 0 0 0
7 7 2020 0 0 0 0 0 0 0 0
8 8 2020 0 0 0 0 0 0 0 0
9 9 2020 0 0 0 0 0 0 0 0
10 10 2020 0 0 0 0 0 0 0 0
# … with 14 more rows
CodePudding user response:
df <- tibble(ID = c(55,67,83,78,43,29),
color = c("red", "yellow", "blue", "red", "green", "yellow"),
color2 = c("red", NA, "yellow", "yellow", NA, "green"),
color3 = c(NA, NA, NA, "blue", NA, NA),
date = c("2020-03-15", "2020-05-02", "2020-05-17", "2020-12-15","2021-01-27", "2021-01-03"))
dat <- df %>% mutate(month = as.yearmon(date)) %>% group_by(month)%>%
pivot_longer(cols = c(color, color2, color3)) %>% mutate(val = 1) %>% drop_na() %>%
select(month, value, val) %>%
pivot_wider(id_cols= month, names_from=value, names_prefix = "freq_",values_from = val, values_fn = sum) %>%
unnest(cols = c(freq_red, freq_blue, freq_yellow, freq_green)) %>%
replace_na(list(freq_red =0, freq_blue = 0, freq_yellow = 0, freq_green = 0)) %>%
ungroup() %>% mutate_if(is.numeric, as.double) %>%mutate(`red_%` = round(freq_red/rowSums(.[2:5])*100,2),
`yellow_%` = round(freq_yellow/rowSums(.[2:5])*100,2),
`blue_%` = round(freq_blue/rowSums(.[2:5])*100,2),
`green_%` = round(freq_green/rowSums(.[2:5])*100,2))