I have a dataframe of time series data which is tracking the status of several species as discrete states. I want to calculate what proportion of each species is in each state for each day in the time series.
So my data is something like this:
example.df <- data.frame(date = rep(seq(ymd("2022-3-6"), ymd("2022-3-8"), by = "days"),
each = 8),
species = rep(c("a", "b"), each = 4),
state = sample(0:2, replace = TRUE, size = 24))
print(example.df)
date species state
1 2022-03-06 a 0
2 2022-03-06 a 2
3 2022-03-06 a 1
4 2022-03-06 a 0
5 2022-03-06 b 2
6 2022-03-06 b 1
7 2022-03-06 b 0
8 2022-03-06 b 1
9 2022-03-07 a 0
10 2022-03-07 a 1
11 2022-03-07 a 0
12 2022-03-07 a 0
13 2022-03-07 b 1
14 2022-03-07 b 1
15 2022-03-07 b 1
16 2022-03-07 b 2
17 2022-03-08 a 1
18 2022-03-08 a 0
19 2022-03-08 a 1
20 2022-03-08 a 1
21 2022-03-08 b 1
22 2022-03-08 b 2
23 2022-03-08 b 2
24 2022-03-08 b 2
I want to produce something like this:
date species proportion.0 proportion.1 proportion.2
1 2022-03-06 a 0.25 0.25 0.50
2 2022-03-06 b 0.25 0.50 0.25
3 2022-03-07 a 0.25 0.50 0.25
4 2022-03-07 b 0.25 0.50 0.25
5 2022-03-08 a 0.00 1.00 0.00
6 2022-03-08 b 0.00 0.75 0.25
I've been trying to do this in dplyr using group_by() and summarise() but I cannot figure out a generic term to get the proportion for every day/ species combination. Any help would be greatly appreciated, staying inside the tidyverse if possible.
CodePudding user response:
You Could do something like this:
library(tidyverse)
library(lubridate)
example.df |>
count(date, species, state) |>
group_by(date, species) |>
mutate(f = n/sum(n),
state = paste0("proportion.", state)) |>
select(-n) |>
ungroup()|>
pivot_wider(names_from = state, values_from = f, values_fill = 0)
#> # A tibble: 6 x 5
#> date species proportion.0 proportion.1 proportion.2
#> <date> <chr> <dbl> <dbl> <dbl>
#> 1 2022-03-06 a 0.25 0.5 0.25
#> 2 2022-03-06 b 0.25 0.5 0.25
#> 3 2022-03-07 a 0.75 0 0.25
#> 4 2022-03-07 b 0.5 0.25 0.25
#> 5 2022-03-08 a 0.5 0 0.5
#> 6 2022-03-08 b 0.5 0 0.5
First we count
the number of species in each state in each day. Then we calculate the proportion of each state (i.e., f
) for each date and species. Lastly, we pivot_wider
to go from long to wide format.
EDIT:
I thought of another option:
example.df |>
mutate(v = 1) |>
pivot_wider(names_prefix = "proportion.",
names_from = state,
values_from = v,
values_fn = list(v = sum),
values_fill = 0) |>
(\(d) mutate(d, across(contains("proportion."), \(x) x/rowSums(d[,3:5]))))()
#> # A tibble: 6 x 5
#> date species proportion.1 proportion.0 proportion.2
#> <date> <chr> <dbl> <dbl> <dbl>
#> 1 2022-03-06 a 0.75 0.25 0
#> 2 2022-03-06 b 0.25 0.25 0.5
#> 3 2022-03-07 a 0.25 0.25 0.5
#> 4 2022-03-07 b 0.5 0.5 0
#> 5 2022-03-08 a 0.25 0.5 0.25
#> 6 2022-03-08 b 0 0 1
Here we set up a placeholder column called v
, then we pivot wider and calculate the number of states per species per date. Then we calculate the proportions using rowSums
.