Home > Back-end >  Counting previous occurrences of certain variable per group and storing as new column
Counting previous occurrences of certain variable per group and storing as new column

Time:03-23

I want to create five new columns that count how often a certain value of "stars" has happened for this business before this particular row (i.e., summing up over all rows with a smaller rolingcount but holding the business constant).

For the first row of each business (i.e., where rolingcount == 0), it should be NA, because there have been no previous occurrences for this business.

Here is an exemplary dataset:

business <-c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc") 
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8) 
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5) 
df <- cbind(business, rolingcount, stars)

I feel my problem is related to this, but with a gist, that I don't get to work: Numbering rows within groups in a data frame

I also unsuccessfully experimented with while loops.

Ideally, something like this will be the output. (I leave out previousthree, previoustwo, previousone, because I believe they will work identical).

business <- c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
previousfives <- c(NA,1,2,NA,1,2,2,2,NA,1,1,1,1,1,2,3)
previousfours <- c(NA,0,0,NA,0,0,0,0,NA,0,0,0,0,1,1,1)
df <- cbind(business, rolingcount, stars, previousfives, previousfours)`

Since, I will have to do this for more than 10 M rows, a fast option would be cool. Your help is much appreciated! :)

CodePudding user response:

I don't know if the option is really fast, I'm not used to deal with that many rows. Here is a solution using dplyr package in the tidyverse :

library(tidyverse)
df %>% 
  as.data.frame() %>% 
  group_by(business) %>% 
  mutate(stars = as.numeric(stars),
         lag_stars = lag(stars, 1, default = 0),
         previousfives = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 5)),
         previousfours = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 4)),
         previousthrees = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 3)),
         previoustwos = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 2)),
         previousones = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 1))) %>% 
  ungroup() %>% 
  select(-lag_stars)

Output :

# A tibble: 16 x 8
   business rolingcount stars previousfives previousfours previousthrees previoustwos previousones
   <chr>    <chr>       <dbl>         <dbl>         <dbl>          <dbl>        <dbl>        <dbl>
 1 aaa      1               5            NA            NA             NA           NA           NA
 2 aaa      2               5             1             0              0            0            0
 3 aaa      3               3             2             0              0            0            0
 4 bbb      1               5            NA            NA             NA           NA           NA
 5 bbb      2               5             1             0              0            0            0
 6 bbb      3               1             2             0              0            0            0
 7 bbb      4               2             2             0              0            0            1
 8 bbb      5               3             2             0              0            1            1
 9 ccc      1               5            NA            NA             NA           NA           NA
10 ccc      2               1             1             0              0            0            0
11 ccc      3               2             1             0              0            0            1
12 ccc      4               3             1             0              0            1            1
13 ccc      5               4             1             0              1            1            1
14 ccc      6               5             1             1              1            1            1
15 ccc      7               5             2             1              1            1            1
16 ccc      8               5             3             1              1            1            1

Basically, group_by is to perform the operation for each business, and it makes a cumulative lagged sum. Maybe it'll lead you to another faster idea if it is too slow. Hope it helped.

  • Related