Home > Back-end >  Writing a function in R with 3 different grouping variables
Writing a function in R with 3 different grouping variables

Time:03-13

I am trying to group a table after three different things at once. I have problems with the last step. The first ten rows of my dataset (band_list) looks like the following:

   BandName         Year BandType
  1 Agamemnon        2020    0
  2 Ajax             2010    0
  3 Ajax             2011    0
  4 Ajax             2012    0
  5 Ajax             2013    0
  6 Ajax             2014    0
  7 Ajax-Egerton     2016    1
  8 Aldo             2016    0
  9 Aldo             2017    0
 10 Aldo-Knottboy    2018    1

First, I want to group my dataset by name and band type. For this, I know that I can use the following code.

df <- band_list %>% group_by(`BandName(Stallion-Tag)`, BandType)

Second, I want to group by year. Meaning, that I want to group the band with the same name and band type as well as consecutive years as one group and count the total amount of occurences. For example, Ajax (rows 2-6) occurs five times with the same name and type and also within consecutive years (2010-2014), so I want to get the result of 5.

occurence_band <- band_list %>% count(`BandName(Stallion-Tag)`, BandType)

I do get this with my code, but I have some data where a year is missing in between but the name and type stays the same. For example,

   BandName Year BandType
40 Arno     2014    0
41 Arno     2015    0
42 Arno     2017    0 
43 Arno     2018    0
44 Arno     2020    0

Using the same codes above R would give me the count of 5 for Arno but I want to have three different groups for Arno here: Arno (2014-2015) with count two, Arno (2017-2018) also with count two and Arno (2020) with count one. I need it seperated like this because it is important for further analysis I have to do. I thought that writing my own function might solve my problem but I only wrote my own function once, so I have now ideas how to tell R to look for consequtive years and count it (and I did not found anything that worked). Any help would be really appreciated.

CodePudding user response:

Create a Runs variable with a logical cumsum trick and include this variable in the grouping variables. Then count the subgroups.

df1 <- read.table(text = "
   BandName Year BandType
40 Arno     2014    0
41 Arno     2015    0
42 Arno     2017    0 
43 Arno     2018    0
44 Arno     2020    0
", header = TRUE)

suppressPackageStartupMessages(library(dplyr))

df1 %>%
  mutate(Runs = cumsum(c(1L, diff(Year)) > 1L)) %>%
  group_by(BandName, BandType, Runs) %>%
  summarize(Count = n(), .groups = "drop") %>%
  select(-Runs)
#> # A tibble: 3 x 3
#>   BandName BandType Count
#>   <chr>       <int> <int>
#> 1 Arno            0     2
#> 2 Arno            0     2
#> 3 Arno            0     1

Created on 2022-03-13 by the reprex package (v2.0.1)

This also works with df2 below.

df2 <- read.table(text = "
BandName         Year BandType
  1 Agamemnon        2020    0
  2 Ajax             2010    0
  3 Ajax             2011    0
  4 Ajax             2012    0
  5 Ajax             2013    0
  6 Ajax             2014    0
  7 Ajax-Egerton     2016    1
  8 Aldo             2016    0
  9 Aldo             2017    0
 10 Aldo-Knottboy    2018    1
", header = TRUE)

df2 %>%
  mutate(Runs = cumsum(c(1L, diff(Year)) > 1L)) %>%
  group_by(BandName, BandType, Runs) %>%
  summarize(Count = n(), .groups = "drop") %>%
  select(-Runs)
#> # A tibble: 5 x 3
#>   BandName      BandType Count
#>   <chr>            <int> <int>
#> 1 Agamemnon            0     1
#> 2 Ajax                 0     5
#> 3 Ajax-Egerton         1     1
#> 4 Aldo                 0     2
#> 5 Aldo-Knottboy        1     1

Created on 2022-03-13 by the reprex package (v2.0.1)

CodePudding user response:

Here using the data.table package:

library(data.table)
dat <- as.data.table(test)
result <- dat[, .SD[, .(count= .N), Year], .(BandName, BandType)]
  • Related