Home > Software engineering >  Select those groups which are available for certain years
Select those groups which are available for certain years

Time:03-20

I have a data.table as following:-

datazzz <- data.table(group = c(rep("a", times = 3),
                                rep("b", times = 4),
                                rep("c", times = 4),
                                rep("k", times = 2),
                                rep("f", times = 4)),
                      year = c(2017:2019, 2016:2019, 2016:2019, 2018, 2019,
                               2017:2020),
                      values = runif(17))

datazzz

    group year     values
 1:     a 2017 0.14710475
 2:     a 2018 0.23493958
 3:     a 2019 0.97570157
 4:     b 2016 0.82078366
 5:     b 2017 0.92685531
 6:     b 2018 0.64406726
 7:     b 2019 0.17611851
 8:     c 2016 0.96894329
 9:     c 2017 0.97501190
10:     c 2018 0.49732578
11:     c 2019 0.90125133
12:     k 2018 0.14836372
13:     k 2019 0.01368339
14:     f 2017 0.84735620
15:     f 2018 0.71688780
16:     f 2019 0.62894310
17:     f 2020 0.73526859

I want to select only those groups who have the years from 2016 till 2019. And hence, my resulting data.table would look like

    group year     values
 1:     b 2016 0.82078366
 2:     b 2017 0.92685531
 3:     b 2018 0.64406726
 4:     b 2019 0.17611851
 5:     c 2016 0.96894329
 6:     c 2017 0.97501190
 7:     c 2018 0.49732578
 8:     c 2019 0.90125133

CodePudding user response:

The subsetting condition is that all years are present in the group. We can construct a variable V1 with that condition and select rows based on that by passing row indices through .I.

datazzz[datazzz[, .I[all(2016:2019 %in% unique(year))], by = .(group)]$V1]

   group year     values
1:     b 2016 0.86527048
2:     b 2017 0.46478348
3:     b 2018 0.94761731
4:     b 2019 0.05005278
5:     c 2016 0.73977484
6:     c 2017 0.23698556
7:     c 2018 0.29560906
8:     c 2019 0.61450736

CodePudding user response:

We could do:

library(data.table)
setDT(datazzz)[, if(min(year) == 2016 & max(year)==2019) .SD, by = group]


   group year    values
1:     b 2016 0.2321175
2:     b 2017 0.2776979
3:     b 2018 0.5695105
4:     b 2019 0.7224908
5:     c 2016 0.1904413
6:     c 2017 0.4608467
7:     c 2018 0.8258316
8:     c 2019 0.7198854
  • Related