I want to create five new columns that count how often a certain value of "stars" has happened for this business before this particular row (i.e., summing up over all rows with a smaller rolingcount but holding the business constant).
For the first row of each business (i.e., where rolingcount == 0), it should be NA, because there have been no previous occurrences for this business.
Here is an exemplary dataset:
business <-c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
df <- cbind(business, rolingcount, stars)
I feel my problem is related to this, but with a gist, that I don't get to work: Numbering rows within groups in a data frame
I also unsuccessfully experimented with while loops.
Ideally, something like this will be the output. (I leave out previousthree, previoustwo, previousone, because I believe they will work identical).
business <- c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
previousfives <- c(NA,1,2,NA,1,2,2,2,NA,1,1,1,1,1,2,3)
previousfours <- c(NA,0,0,NA,0,0,0,0,NA,0,0,0,0,1,1,1)
df <- cbind(business, rolingcount, stars, previousfives, previousfours)`
Since, I will have to do this for more than 10 M rows, a fast option would be cool. Your help is much appreciated! :)
CodePudding user response:
I don't know if the option is really fast, I'm not used to deal with that many rows. Here is a solution using dplyr package in the tidyverse :
library(tidyverse)
df %>%
as.data.frame() %>%
group_by(business) %>%
mutate(stars = as.numeric(stars),
lag_stars = lag(stars, 1, default = 0),
previousfives = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 5)),
previousfours = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 4)),
previousthrees = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 3)),
previoustwos = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 2)),
previousones = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 1))) %>%
ungroup() %>%
select(-lag_stars)
Output :
# A tibble: 16 x 8
business rolingcount stars previousfives previousfours previousthrees previoustwos previousones
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 aaa 1 5 NA NA NA NA NA
2 aaa 2 5 1 0 0 0 0
3 aaa 3 3 2 0 0 0 0
4 bbb 1 5 NA NA NA NA NA
5 bbb 2 5 1 0 0 0 0
6 bbb 3 1 2 0 0 0 0
7 bbb 4 2 2 0 0 0 1
8 bbb 5 3 2 0 0 1 1
9 ccc 1 5 NA NA NA NA NA
10 ccc 2 1 1 0 0 0 0
11 ccc 3 2 1 0 0 0 1
12 ccc 4 3 1 0 0 1 1
13 ccc 5 4 1 0 1 1 1
14 ccc 6 5 1 1 1 1 1
15 ccc 7 5 2 1 1 1 1
16 ccc 8 5 3 1 1 1 1
Basically, group_by is to perform the operation for each business, and it makes a cumulative lagged sum. Maybe it'll lead you to another faster idea if it is too slow. Hope it helped.