Home > Software engineering >  Group R dataframe by counter variable and apply filtering/sorting
Group R dataframe by counter variable and apply filtering/sorting

Time:11-16

My dataframe:

amount counter
120 1
778 1
4 2
322 3
51 4
6 5
17 6
1 6
489 7
224 8
33 9
56 9
... ...

What I want to do:

The counter variable is counting in which week my data is occuring. I want to group the dataframe by 4 weeks, meaning I have to group by the counter variable e.g., 1 to 4, 5 to 8 and so on. Then I want to sort the groups ascending by amount and only keep the rows with the 30 highest amounts per group.

What functions I might need:

group_by (dplyr package) and filter (dplyr package)

What I tried:

I found out that I will need the "group_by" and "filter" functions from the dplyr package. I read thru the documentation and took a look at code examples but am not able to refer it to my usecase.

CodePudding user response:

Four-weekly groupings

If you want to do this four-weekly using dplyr, you can do this:

NUM_WEEKS  <- 4
TOP_N  <- 30

dat  |>
    mutate(
        four_week_group = cut(
            counter, breaks = seq(
                from = 0,
                to = max(counter)   NUM_WEEKS,
                by = NUM_WEEKS,
            )
        ),
        group_num  = as.integer(four_week_group)
    )  |>
    group_by(four_week_group)  |>
    arrange(four_week_group, desc(amount))  |>
    filter(row_number() <= TOP_N)  

# # A tibble: 12 x 4
# # Groups:   four_week_group [3]
#    amount counter four_week_group group_num
#     <int>   <int> <fct>               <int>
#  1    778       1 (0,4]                   1
#  2    322       3 (0,4]                   1
#  3    120       1 (0,4]                   1
#  4     51       4 (0,4]                   1
#  5      4       2 (0,4]                   1
#  6    489       7 (4,8]                   2
#  7    224       8 (4,8]                   2
#  8     17       6 (4,8]                   2
#  9      6       5 (4,8]                   2
# 10      1       6 (4,8]                   2
# 11     56       9 (8,12]                  3
# 12     33       9 (8,12]                  3

The output is the same as the input in this case (apart from the row group) as all groups have <30 rows.

Monthly groupings

You could do the same with monthly groupings as well if you provide a year:

YEAR  <- 2022
dat  |>
    mutate(
        date_to_use = as.Date(
                paste(YEAR, counter, 1, sep="-"), 
            "%Y-%U-%u"), 
        month = format(date_to_use, "%B")
    )  
#    amount counter date_to_use    month
# 1     120       1  2022-01-03  January
# 2     778       1  2022-01-03  January
# 3       4       2  2022-01-10  January
# 4     322       3  2022-01-17  January
# 5      51       4  2022-01-24  January
# 6       6       5  2022-01-31  January
# 7      17       6  2022-02-07 February
# 8       1       6  2022-02-07 February
# <etc>

The rest of the steps would be the same (using group_by(month) of course).

Data

dat  <- structure(list(amount = c(120L, 778L, 4L, 322L, 51L, 6L, 17L,
1L, 489L, 224L, 33L, 56L), counter = c(1L, 1L, 2L, 3L, 4L, 5L,
6L, 6L, 7L, 8L, 9L, 9L)), class = "data.frame", row.names = c(NA,
-12L))

# # A tibble: 12 x 3
# # Groups:   four_week_group [3]
#    amount counter four_week_group
#     <int>   <int> <fct>
#  1    778       1 (0,4]
#  2    489       7 (4,8]
#  3    322       3 (0,4]
#  4    224       8 (4,8]
#  5    120       1 (0,4]
#  6     56       9 (8,12]
#  7     51       4 (0,4]
#  8     33       9 (8,12]
#  9     17       6 (4,8]
# 10      6       5 (4,8]
# 11      4       2 (0,4]
# 12      1       6 (4,8]
  • Related