Dataset of blood results:
id result date
A1 80 01/01/2006
A1 70 02/10/2006
A1 61 01/01/2007
A1 30 01/01/2008
A1 28 03/06/2008
B2 40 01/01/2006
B2 30 01/10/2006
B2 25 01/01/2015
B2 10 01/01/2020
G3 28 01/01/2009
G3 27 01/01/2014
G3 25 01/01/2013
G3 24 01/01/2011
G3 22 01/01/2019
U7 20 01/01/2005
U7 19 01/01/2006
U7 18 01/04/2006
U7 18 01/08/2006
I would like to only keep those individuals who have blood results spanning at least a three year period.
I can convert their dates to just years along the line of:
df %>%
# Create a column holding year for each ID
mutate(date = dmy(date)) %>%
mutate(year = year(date)) %>%
# group by ID
group_by(ID, year) %>%
# find max diff
summarise(max_diff = max(year) - min(year))
How would I then continue the pipe to remove those with a max diff <3. The desired output from the above example would be:
id result date
B2 40 2006
B2 30 2006
B2 25 2015
B2 10 2020
G3 28 2009
G3 27 2014
G3 25 2013
G3 24 2011
G3 22 2019
I would then pipe these people into the answer for this question: Predicting when an output might happen in time in R
Many thanks
CodePudding user response:
Does this work:
library(dplyr)
library(lubridate)
df %>% mutate(year = year(dmy(date))) %>% group_by(id) %>%
filter(length(unique(year)) > 2)
# A tibble: 14 × 4
# Groups: id [3]
id result date year
<chr> <dbl> <chr> <dbl>
1 A1 80 01/01/2006 2006
2 A1 70 02/10/2006 2006
3 A1 61 01/01/2007 2007
4 A1 30 01/01/2008 2008
5 A1 28 03/06/2008 2008
6 B2 40 01/01/2006 2006
7 B2 30 01/10/2006 2006
8 B2 25 01/01/2015 2015
9 B2 10 01/01/2020 2020
10 G3 28 01/01/2009 2009
11 G3 27 01/01/2014 2014
12 G3 25 01/01/2013 2013
13 G3 24 01/01/2011 2011
14 G3 22 01/01/2019 2019
CodePudding user response:
Cutting down to year only is not a great move because you'll under/overestimate the range depending on whether people had their first/last blood test early or late in the year (case in point, A1 has blood tests in 3 separate calendar years, but less than 3 years apart overall).
To actually work out the range between first and last blood test, you can use diff(range())
. In this case, I'm using tapply
.
df <- structure(list(id = c("A1", "A1", "A1", "A1", "A1", "B2", "B2",
"B2", "B2", "G3", "G3", "G3", "G3", "G3", "U7", "U7", "U7", "U7"
), result = c(80L, 70L, 61L, 30L, 28L, 40L, 30L, 25L, 10L, 28L,
27L, 25L, 24L, 22L, 20L, 19L, 18L, 18L), date = structure(c(13149,
13423, 13514, 13879, 14033, 13149, 13422, 16436, 18262, 14245,
16071, 15706, 14975, 17897, 12784, 13149, 13239, 13361), class = "Date")),
row.names = c(NA, -18L), class = "data.frame")
ranges <- tapply(df$date, df$id, function(dates) diff(range(dates))/365.25)
ranges
A1 B2 G3 U7
2.420260 13.998631 9.998631 1.579740
keep <- names(ranges)[ranges>=3]
keep
[1] "B2" "G3"
You can flip this to use pipes if that is your preferred syntax.