If a dataframe is split by two groups, several categories and dates, how to get a list of dataframes-CodePudding

I have the next dataframe:

House = rep(c('A','B','C'), each=10)
Date = as.Date(c('2014-10-13','2014-10-20','2014-10-27','2014-11-03','2014-11-10','2014-10-30','2014-11-06','2014-11-13','2014-11-20','2014-11-27',
                 '2019-11-27','2019-12-04','2019-12-11','2019-12-18','2019-12-25','2020-01-01','2020-01-08','2020-01-15','2020-01-22','2020-01-29',
                 '2017-07-13','2017-07-20','2017-04-21','2017-04-28','2017-05-05','2017-05-12','2017-05-19','2017-05-26','2017-06-02','2017-06-09'))
Week = rep(c(1,2,3,4,5,6,7,8,9,10), times=3)
Value = c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1,0.001,0.002,0.003,0.004,0.005,0.006,0.007,0.008,0.009,0.01)

mock = data.frame(House,Date,Week,Value)

I want to split this dataframe by Year in specific, BUT, I want to keep together the group B. Otherwise it gets split as follows:

$`2019`
    A data.frame: 5 × 4     House   Date    Week    Value
        <chr>   <date>  <dbl>   <dbl>
    11  B   2019-11-27  1   0.01
    12  B   2019-12-04  2   0.02
    13  B   2019-12-11  3   0.03
    14  B   2019-12-18  4   0.04
    15  B   2019-12-25  5   0.05
$`2020`
    A data.frame: 5 × 4     House   Date    Week    Value
        <chr>   <date>  <dbl>   <dbl>
    16  B   2020-01-01  6   0.06
    17  B   2020-01-08  7   0.07
    18  B   2020-01-15  8   0.08
    19  B   2020-01-22  9   0.09
    20  B   2020-01-29  10  0.10

Desired output:

$`2019-2020`
    A data.frame: 5 × 4     House   Date    Week    Value
        <chr>   <date>  <dbl>   <dbl>
    11  B   2019-11-27  1   0.01
    12  B   2019-12-04  2   0.02
    13  B   2019-12-11  3   0.03
    14  B   2019-12-18  4   0.04
    15  B   2019-12-25  5   0.05
    16  B   2020-01-01  6   0.06
    17  B   2020-01-08  7   0.07
    18  B   2020-01-15  8   0.08
    19  B   2020-01-22  9   0.09
    20  B   2020-01-29  10  0.10

I tried a similar approach using dplyr::group_split(), but obtaining a similar output. On the other hand, the cleanest list with split() is given by the next, but I DON'T need it by House:

house.year = split(mock, c(House, format(mock$Date, "%Y")))
house.year = house.year[sapply(house.year, function(x) dim(x)[1]) > 0]

NOTE: The name of the desired list is how I imagine it, not how I expect it, but I would like that content included in this list of lists.

EDIT: edited the title to point that I have more than 3 categories, like asked in the beginning.

CodePudding user response：

house.year <- split(mock, list(mock$House, format(mock$Date, "%Y")), drop = TRUE)
lst1 <- lapply(split(house.year, sub("\\.\\d $", "", 
    names(house.year))),
     \(x) {tmp <- do.call(rbind, x)
         row.names(tmp) <- NULL; tmp})
names(lst1) <- sapply(lst1, \(x) paste(unique(format(x$Date, "%Y")),
     collapse = "-"))

-output

> lst1
$`2014`
   House       Date Week Value
1      A 2014-10-13    1   0.1
2      A 2014-10-20    2   0.2
3      A 2014-10-27    3   0.3
4      A 2014-11-03    4   0.4
5      A 2014-11-10    5   0.5
6      A 2014-10-30    6   0.6
7      A 2014-11-06    7   0.7
8      A 2014-11-13    8   0.8
9      A 2014-11-20    9   0.9
10     A 2014-11-27   10   1.0

$`2019-2020`
   House       Date Week Value
1      B 2019-11-27    1  0.01
2      B 2019-12-04    2  0.02
3      B 2019-12-11    3  0.03
4      B 2019-12-18    4  0.04
5      B 2019-12-25    5  0.05
6      B 2020-01-01    6  0.06
7      B 2020-01-08    7  0.07
8      B 2020-01-15    8  0.08
9      B 2020-01-22    9  0.09
10     B 2020-01-29   10  0.10

$`2017`
   House       Date Week Value
1      C 2017-07-13    1 0.001
2      C 2017-07-20    2 0.002
3      C 2017-04-21    3 0.003
4      C 2017-04-28    4 0.004
5      C 2017-05-05    5 0.005
6      C 2017-05-12    6 0.006
7      C 2017-05-19    7 0.007
8      C 2017-05-26    8 0.008
9      C 2017-06-02    9 0.009
10     C 2017-06-09   10 0.010

CodePudding user response：

Probably you can try code below

mock %>%
  group_by(House) %>%
  mutate(year = paste0(unique(format(Date, "%Y")), collapse = "-")) %>%
  ungroup() %>%
  split(~year) %>%
  lapply(`[`, -(length(mock) 1))

which gives

$`2014`
# A tibble: 10 × 4
   House Date        Week Value
   <chr> <date>     <dbl> <dbl>
 1 A     2014-10-13     1   0.1
 2 A     2014-10-20     2   0.2
 3 A     2014-10-27     3   0.3
 4 A     2014-11-03     4   0.4
 5 A     2014-11-10     5   0.5
 6 A     2014-10-30     6   0.6
 7 A     2014-11-06     7   0.7
 8 A     2014-11-13     8   0.8
 9 A     2014-11-20     9   0.9
10 A     2014-11-27    10   1

$`2017`
# A tibble: 10 × 4
   House Date        Week Value
   <chr> <date>     <dbl> <dbl>
 1 C     2017-07-13     1 0.001
 2 C     2017-07-20     2 0.002
 3 C     2017-04-21     3 0.003
 4 C     2017-04-28     4 0.004
 5 C     2017-05-05     5 0.005
 6 C     2017-05-12     6 0.006
 7 C     2017-05-19     7 0.007
 8 C     2017-05-26     8 0.008
 9 C     2017-06-02     9 0.009
10 C     2017-06-09    10 0.01

$`2019-2020`
# A tibble: 10 × 4
   House Date        Week Value
   <chr> <date>     <dbl> <dbl>
 1 B     2019-11-27     1  0.01
 2 B     2019-12-04     2  0.02
 3 B     2019-12-11     3  0.03
 4 B     2019-12-18     4  0.04
 5 B     2019-12-25     5  0.05
 6 B     2020-01-01     6  0.06
 7 B     2020-01-08     7  0.07
 8 B     2020-01-15     8  0.08
 9 B     2020-01-22     9  0.09
10 B     2020-01-29    10  0.1

CodePudding user response：

My real dataset is a bit more complex with more Houses in it. Then after akrun solution I would get a list with repeated Years as name, like here for 2017:

$`2014`
   House       Date Week Value
1      A 2014-10-13    1   0.1
2      A 2014-10-20    2   0.2
...

$`2019-2020`
   House       Date Week Value
1      B 2019-11-27    1  0.01
2      B 2019-12-04    2  0.02
...

$`2017`
   House       Date Week Value
1      C 2017-07-13    1 0.001
2      C 2017-07-20    2 0.002
...

$`2017`
   House       Date Week Value
1      D 2017-07-13    1 0.001
2      D 2017-07-20    2 0.002
...

akrun's solution can be improved by adding another lapply, like in this solution by Mohamed, to rbind all the rows under the same Year, which is what I wanted in the end for subsequent analyses:

lst1 = lapply(split(lst1, names(lst1)) , \(x) do.call(rbind , x))

output:

$`2014`
   House       Date Week Value
1      A 2014-10-13    1   0.1
2      A 2014-10-20    2   0.2
...

$`2019-2020`
   House       Date Week Value
1      B 2019-11-27    1  0.01
2      B 2019-12-04    2  0.02
...

$`2017`
   House       Date Week Value
1      C 2017-07-13    1 0.001
2      C 2017-07-20    2 0.002
1      D 2017-07-13    1 0.001
2      D 2017-07-20    2 0.002
...