I am using dplyr
to create multiple variables in my data frame using mutate. At the same time, I am using zoo
to calculate a rolling average. As an example, I have my variables set up like so:
vars <- "total_apples", "total_oranges", "total_bananas"
My data has over 100 variables and approx. 40,000 lines, but the above is just an example.
I am using this code below:
library(dplyr)
library(zoo)
data <- data %>%
group_by(fruit) %>%
mutate(across(c(all_of(vars)), list(avge_last2 = ~ zoo::rollapplyr(., 2, FUN = mean, partial = TRUE))))
Just for the above to calculate the average over the last 2 records, it takes over 5 mins:
> end.time <- Sys.time()
> time.taken <- end.time - start.time
> time.taken
Time difference of 5.925337 mins
It takes even longer if I want to average over more records, say n= 10 like so:
library(dplyr)
library(zoo)
data <- data %>%
group_by(fruit) %>%
mutate(across(c(all_of(vars)), list(avge_last2 = ~ zoo::rollapplyr(., 10, FUN = mean, partial = TRUE))))
Is there an issue with my code or is it something else?
dput(head(data,20))
provides the following:
structure(list(match_id = c(14581L, 14581L, 14581L, 14581L, 14581L,
14581L, 14581L, 14581L, 14581L, 14581L, 14581L, 14581L, 14581L,
14581L, 14581L, 14581L, 14581L, 14581L, 14581L, 14581L), match_date = structure(c(16527,
16527, 16527, 16527, 16527, 16527, 16527, 16527, 16527, 16527,
16527, 16527, 16527, 16527, 16527, 16527, 16527, 16527, 16527,
16527), class = "Date"), season = c(2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015), match_round = c(1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), home_team = c(3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), away_team = c(14,
14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
14, 14, 14), venue = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11), venue_name = c("MCG",
"MCG", "MCG", "MCG", "MCG", "MCG", "MCG", "MCG", "MCG", "MCG",
"MCG", "MCG", "MCG", "MCG", "MCG", "MCG", "MCG", "MCG", "MCG",
"MCG"), opponent = c(14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
14, 14, 14, 14, 14, 14, 14, 14, 14, 14), player_id = c(11186L,
11215L, 11285L, 11330L, 11380L, 11388L, 11407L, 11472L, 11473L,
11490L, 11553L, 11561L, 11573L, 11582L, 11598L, 11601L, 11616L,
11643L, 11671L, 11737L), player_first_name = c("Chris", "Chris",
"Kade", "Troy", "Andrew", "Brett", "Cameron", "Marc", "Dale",
"Ivan", "Bryce", "Shane", "Bachar", "Jack", "Andrejs", "Shaun",
"Michael", "Lachie", "Trent", "Alex"), player_last_name = c("Judd",
"Newman", "Simpson", "Chaplin", "Carrazzo", "Deledio", "Wood",
"Murphy", "Thomas", "Maric", "Gibbs", "Edwards", "Houli", "Riewoldt",
"Everitt", "Grigg", "Jamison", "Henderson", "Cotchin", "Rance"
), player_team = c("Carlton", "Richmond", "Carlton", "Richmond",
"Carlton", "Richmond", "Carlton", "Carlton", "Carlton", "Richmond",
"Carlton", "Richmond", "Richmond", "Richmond", "Carlton", "Richmond",
"Carlton", "Carlton", "Richmond", "Richmond"), player_team_numeric = c(3,
14, 3, 14, 3, 14, 3, 3, 3, 14, 3, 14, 14, 14, 3, 14, 3, 3, 14,
14), guernsey_number = c(5L, 1L, 6L, 25L, 44L, 3L, 36L, 3L, 39L,
20L, 4L, 10L, 14L, 8L, 33L, 6L, 40L, 23L, 9L, 18L), player_position = c(3,
14, 14, 1, 17, 13, 16, 12, 20, 16, 14, 5, 10, 8, 13, 14, 6, 7,
3, 2), disposals = c(21L, 7L, 21L, 13L, 18L, 18L, 11L, 21L, 1L,
13L, 26L, 21L, 21L, 17L, 18L, 17L, 8L, 10L, 17L, 18L), kicks = c(16L,
6L, 13L, 9L, 9L, 9L, 8L, 9L, 1L, 8L, 15L, 9L, 15L, 13L, 14L,
9L, 4L, 9L, 6L, 9L), marks = c(5L, 1L, 8L, 1L, 2L, 3L, 2L, 2L,
0L, 4L, 4L, 1L, 5L, 8L, 8L, 4L, 2L, 6L, 3L, 4L), handballs = c(5L,
1L, 8L, 4L, 9L, 9L, 3L, 12L, 0L, 5L, 11L, 12L, 6L, 4L, 4L, 8L,
4L, 1L, 11L, 9L), tackles = c(6L, 1L, 2L, 2L, 2L, 0L, 1L, 2L,
0L, 4L, 4L, 3L, 1L, 0L, 2L, 2L, 1L, 2L, 1L, 0L), clearances = c(6L,
0L, 0L, 0L, 6L, 1L, 6L, 4L, 0L, 4L, 4L, 7L, 0L, 0L, 1L, 3L, 0L,
0L, 1L, 1L), brownlow_votes = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), effective_disposals = c(15L,
6L, 16L, 11L, 16L, 13L, 6L, 14L, 1L, 11L, 13L, 16L, 16L, 10L,
14L, 12L, 5L, 6L, 9L, 17L), disposal_efficiency_percentage = c(71L,
86L, 76L, 85L, 89L, 72L, 55L, 67L, 100L, 85L, 50L, 76L, 76L,
59L, 78L, 71L, 63L, 60L, 53L, 94L), contested_possessions = c(11L,
3L, 5L, 7L, 9L, 6L, 7L, 9L, 1L, 9L, 9L, 15L, 1L, 7L, 3L, 4L,
3L, 4L, 5L, 5L), uncontested_possessions = c(10L, 4L, 17L, 6L,
10L, 12L, 4L, 12L, 0L, 4L, 17L, 7L, 18L, 9L, 14L, 11L, 5L, 7L,
12L, 14L), time_on_ground_percentage = c(79L, 65L, 73L, 100L,
76L, 69L, 89L, 81L, 1L, 88L, 73L, 83L, 85L, 98L, 95L, 81L, 96L,
91L, 86L, 96L), afl_fantasy_score = c(93L, 26L, 97L, 42L, 54L,
53L, 61L, 67L, 4L, 91L, 96L, 67L, 78L, 89L, 80L, 80L, 30L, 54L,
54L, 58L), contested_marks = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L,
0L, 2L, 1L, 0L, 1L, 3L, 0L, 0L, 0L, 1L, 0L, 0L), metres_gained = c(474L,
231L, 269L, 165L, 128L, 181L, 151L, 227L, -7L, 160L, 466L, 332L,
709L, 268L, 464L, 283L, 99L, 257L, 203L, 288L), turnovers = c(5L,
3L, 4L, 2L, 3L, 2L, 2L, 4L, 0L, 1L, 6L, 2L, 5L, 8L, 5L, 2L, 2L,
3L, 3L, 1L), effective_kicks = c(11L, 5L, 9L, 7L, 7L, 4L, 3L,
5L, 1L, 6L, 5L, 4L, 11L, 7L, 12L, 5L, 2L, 6L, 1L, 9L), ground_ball_gets = c(8L,
2L, 4L, 5L, 7L, 4L, 4L, 8L, 0L, 3L, 6L, 9L, 0L, 4L, 3L, 2L, 2L,
2L, 5L, 3L), cum_rec = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), rank_match_kicks = c(2, 34,
10.5, 20.5, 20.5, 20.5, 28, 20.5, 43, 28, 4.5, 20.5, 4.5, 10.5,
8, 20.5, 39.5, 20.5, 34, 20.5), rank_match_marks = c(14, 39,
5, 39, 33, 27.5, 33, 33, 43.5, 20.5, 20.5, 39, 14, 5, 5, 20.5,
33, 10, 27.5, 20.5)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
CodePudding user response:
I find that processing grouped dataframes in dplyer can really slows things down, I'm not sure if it's the best workaround but when I finish grouping I pipe in
%>% as.data.frame()
to get rid of the grouping information, and then do my calculations afterward. It can save a lot of time. If you've previously grouped a large dataset give that a try.
CodePudding user response:
You could try this:
library(data.table)
setDT(data)
data[,paste0(vars, "_avge_last2_"):= lapply(.SD, frollmean, n=2),
.SDcols=vars,
by=.(fruit)
]