Trying to apply a function to a large dataset. Specifically, trying to apply the mean of the lowest 1000 times (df$time) set before the date (df$date) found in that row. Applying this function on a small portion worked
However, because the dataset is so large, I want to restrict the apply to just the 1% of rows where df$wr is true.
This is the code I wrote so far with mean1000 as the intended name of the new variable and the data set split based on name (25 categories):
df1 <- data.frame(
mean1000 = lapply(
split(df, df$name), function(y)
df$y$mean1000 = apply(y, 1, function(x) {ifelse(x["wr" == TRUE],
mean(sort(df$time[df$date < x["date"]])[2:1000]), NA)})) %>%
unlist()
)
Result:
df1 is created, but it's just a table with 0 observations of 1 variable (mean1000)
The error message is 25 times the following:
1. Unknown or uninitialised column `y`.
I mostly followed the guidelines as outlined here, but those solutions are less complex/layered than what I'm trying to do. How can I adjust the code?
Data:
| # | time | date | id1 | id2 | rank | name | wr |
|---|------|-----------|-----|-----|------|-------|------|
| 1 | 2408 | 2022-06-04| a8m2| pr9w| 24 | City01| TRUE |
| 2 | 2503 | 2022-06-25| b6p5| ur1r| 226 | City01| FALSE|
| 3 | 2672 | 2022-05-07| c8k1| py5l| 371 | City01| FALSE|
The desired result is to have an extra column added in which the mean calculated (mean(sort(df$time[df$date < x["date"]])[2:1000])) is added when the wr value is TRUE.
Edit to show the adjustments made to Parfait's answer:
# SORT DATA BY NAME AND DATE
df <- with(df, df[order(name, date),]) |> `row.names<-`(NULL)
df <- as.vector(df)
# CONDITIONALLY CALCULATE MEAN BY GROUP
df$mean1000 <- by(df, df$name, function(sub) {
# ITERATE THROUGH EVERY DATE ROW WHILE CONDITIONALLY ADJUSTING BY wr FLAG
mean1000 <- ifelse(sub$wr == TRUE, sapply(
sub$date,
# SUBSET AND CALCULATE MEAN
FUN=\(dt) mean(sub$time[sub$date< dt][2:1000], na.rm=TRUE)
), NA_real_)
})
# CONVERT VECTOR BACK TO DATA FRAME AND RENAME COLUMN
df <- data.frame(df$id1, df$id2, df$id3, df$time, df$date, df$rank, df$name, df$wr, as.numeric(unlist(df$mean1000)))
colnames(df) <- c('id1', 'id2', 'id3', 'time', 'date', 'rank', 'name', 'wr', 'mean1000')
CodePudding user response:
Consider by
(object-oriented wrapper to tapply
) which is very similar to split
lapply
but more streamlined. Then run an embedded sapply
for rowwise mean conditional calculations.
# SORT DATA BY NAME AND DATE
df1 <- with(df1, df1[order(name, date),]) |> `row.names<-`(NULL)
# CONDITIONALLY CALCULATE MEAN BY GROUP
df1$mean100 <- by(df1, df1$name, function(sub), {
# ITERATE THROUGH EVERY DATE ROW
mean1000 <- sapply(
sub$date,
# SUBSET AND CALCULATE MEAN
FUN=\(dt) mean(sub$time[sub$date < dt][2:1000], na.rm=TRUE)
)
# CONDITIONALLY ADJUST BY wr FLAG
mean1000 <- ifelse(sub$wr == TRUE, mean1000, NA_real_)
}) |> as.vector()