Removing people from a dataframe in R if their results don't span a specific period of time-CodePudding

Dataset of blood results:

        id result date    
        A1 80     01/01/2006
        A1 70     02/10/2006
        A1 61     01/01/2007
        A1 30     01/01/2008
        A1 28     03/06/2008
        B2 40     01/01/2006
        B2 30     01/10/2006
        B2 25     01/01/2015
        B2 10     01/01/2020
        G3 28     01/01/2009
        G3 27     01/01/2014
        G3 25     01/01/2013
        G3 24     01/01/2011
        G3 22     01/01/2019
        U7 20     01/01/2005
        U7 19     01/01/2006
        U7 18     01/04/2006
        U7 18     01/08/2006

I would like to only keep those individuals who have blood results spanning at least a three year period.

I can convert their dates to just years along the line of:

df %>% 
  # Create a column holding year for each ID
  mutate(date = dmy(date)) %>% 
  mutate(year = year(date)) %>% 
  # group by ID
  group_by(ID, year) %>% 
  # find max diff
  summarise(max_diff = max(year) - min(year))

How would I then continue the pipe to remove those with a max diff <3. The desired output from the above example would be:

    id result date    
    B2 40     2006
    B2 30     2006
    B2 25     2015
    B2 10     2020
    G3 28     2009
    G3 27     2014
    G3 25     2013
    G3 24     2011
    G3 22     2019

I would then pipe these people into the answer for this question: Predicting when an output might happen in time in R

Many thanks

CodePudding user response：

Does this work:

library(dplyr)
library(lubridate)

df %>% mutate(year = year(dmy(date))) %>% group_by(id) %>% 
                                     filter(length(unique(year)) > 2)
# A tibble: 14 × 4
# Groups:   id [3]
   id    result date        year
   <chr>  <dbl> <chr>      <dbl>
 1 A1        80 01/01/2006  2006
 2 A1        70 02/10/2006  2006
 3 A1        61 01/01/2007  2007
 4 A1        30 01/01/2008  2008
 5 A1        28 03/06/2008  2008
 6 B2        40 01/01/2006  2006
 7 B2        30 01/10/2006  2006
 8 B2        25 01/01/2015  2015
 9 B2        10 01/01/2020  2020
10 G3        28 01/01/2009  2009
11 G3        27 01/01/2014  2014
12 G3        25 01/01/2013  2013
13 G3        24 01/01/2011  2011
14 G3        22 01/01/2019  2019

CodePudding user response：

Cutting down to year only is not a great move because you'll under/overestimate the range depending on whether people had their first/last blood test early or late in the year (case in point, A1 has blood tests in 3 separate calendar years, but less than 3 years apart overall).

To actually work out the range between first and last blood test, you can use diff(range()). In this case, I'm using tapply.

df <- structure(list(id = c("A1", "A1", "A1", "A1", "A1", "B2", "B2", 
"B2", "B2", "G3", "G3", "G3", "G3", "G3", "U7", "U7", "U7", "U7"
), result = c(80L, 70L, 61L, 30L, 28L, 40L, 30L, 25L, 10L, 28L, 
27L, 25L, 24L, 22L, 20L, 19L, 18L, 18L), date = structure(c(13149, 
13423, 13514, 13879, 14033, 13149, 13422, 16436, 18262, 14245, 
16071, 15706, 14975, 17897, 12784, 13149, 13239, 13361), class = "Date")), 
row.names = c(NA, -18L), class = "data.frame")

ranges <- tapply(df$date, df$id, function(dates) diff(range(dates))/365.25)
ranges

       A1        B2        G3        U7 
 2.420260 13.998631  9.998631  1.579740 

keep <- names(ranges)[ranges>=3]
keep
[1] "B2" "G3"

You can flip this to use pipes if that is your preferred syntax.