Trying to figure out how to code something simple.
I have a dataset that has observations for individuals (small invertebrates) in my experiment over time, including the week, individual's id #, and the observation data of interest (parasite counts). I also have a cumulative total over time for the parasite counts, grouped by the individual's ID, which is what I will actually want per week.
I would like to drop individuals that, by the end of the experiment, never had an observed sample that was positive for parasites, because they were not successfully infected. My plan was to have a binary indicator column that told me if an individual didn't have a positive sample by the end of the experiment, based on the final cumulative total per individual id (it's possible that an individual could give a positive sample one week but not the next, so a 0 cumulative total is more safe). Then I would simply subset the data by the positive binary column, removing individuals who were never positive.
A very simplified version of my dataframe would look something like:
time = c(rep(1,4),rep(2,4),rep(3,4),rep(4,4))
ids = rep(c(101:104),4)
observations = c(rep(c(25,25,0,0),4))
df = data.frame(cbind(time,ids,observations))
df2 = df %>%
group_by(ids) %>%
mutate(cumtot = cumsum(observations))
df2
time ids observations cumtot
<dbl> <dbl> <dbl> <dbl>
1 1 101 25 25
2 1 102 25 25
3 1 103 0 0
4 1 104 0 0
5 2 101 25 50
6 2 102 25 50
7 2 103 0 0
8 2 104 0 0
9 3 101 25 75
10 3 102 25 75
11 3 103 0 0
12 3 104 0 0
13 4 101 25 100
14 4 102 25 100
15 4 103 0 0
16 4 104 0 0
(I will eventually aggregate these data into means/SEMs by week and treatment group.)
What I have tried so far creates a binary "infected" column, but identifies individuals that had a cumulative sum of 0 in week 14 only. What I want is for the code to then apply this binary outcome to all the individual ids from every week (so that I drop that individual from each week's aggregate data). Not sure how to do that...
# Make a column that indicates if a snail has not shed by experiment end
df_dropped = df2 %>%
group_by(ids) %>%
mutate(infected = ifelse(time==max(time)&cumtot==0, 0,1))
df_dropped
time ids observations cumtot infected
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 101 25 25 1
2 1 102 25 25 1
3 1 103 0 0 1
4 1 104 0 0 1
5 2 101 25 50 1
6 2 102 25 50 1
7 2 103 0 0 1
8 2 104 0 0 1
9 3 101 25 75 1
10 3 102 25 75 1
11 3 103 0 0 1
12 3 104 0 0 1
13 4 101 25 100 1
14 4 102 25 100 1
15 4 103 0 0 0
16 4 104 0 0 0
I want the output to be:
time ids observations cumtot infected
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 101 25 25 1
2 1 102 25 25 1
3 1 103 0 0 0
4 1 104 0 0 0
5 2 101 25 50 1
6 2 102 25 50 1
7 2 103 0 0 0
8 2 104 0 0 0
9 3 101 25 75 1
10 3 102 25 75 1
11 3 103 0 0 0
12 3 104 0 0 0
13 4 101 25 100 1
14 4 102 25 100 1
15 4 103 0 0 0
16 4 104 0 0 0
Thanks.
CodePudding user response:
You can just use any()
:
library(tidyverse)
df_dropped <- df2 %>%
group_by(ids) %>%
mutate(infected = as.numeric(any(observations > 0)))
df_dropped
#> # A tibble: 16 x 5
#> # Groups: ids [4]
#> time ids observations cumtot infected
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 101 25 25 1
#> 2 1 102 25 25 1
#> 3 1 103 0 0 0
#> 4 1 104 0 0 0
#> 5 2 101 25 50 1
#> 6 2 102 25 50 1
#> 7 2 103 0 0 0
#> 8 2 104 0 0 0
#> 9 3 101 25 75 1
#> 10 3 102 25 75 1
#> 11 3 103 0 0 0
#> 12 3 104 0 0 0
#> 13 4 101 25 100 1
#> 14 4 102 25 100 1
#> 15 4 103 0 0 0
#> 16 4 104 0 0 0
Created on 2022-02-28 by the reprex package (v2.0.1)