I am trying to filter my dataset based on multiple parameters within a loop. Below is a sample dataset;
set.seed(42)
dates <- seq.Date(as.Date("2020-12-30"), as.Date("2021-01-30"), "day")
n <- length(dates)
df <- data.frame(Timestamp = dates,
topic = rep(LETTERS[1:2], n/2),
value = sample(1:10, n, replace=TRUE))
df <- df %>% mutate(Year = lubridate::year(Timestamp))
And following is the code I am using to filter my data:
df_list <- list() #to save the output
i = 1
for (cat in c('A', 'B')){
for (year in c('2020', '2021')){
df_list[i] <- df %>%
filter(topic == cat) %>%
filter(Year == year) %>%
{if(year == '2020'){
filter(value < 5)
} else {filter(value > 5)}
}
}
}
But it gives me the following error:
Error in filter(value < 5) : object 'value' not found
And if I remove the {
before if
and its closing bracket, it gives me the following error:
Error in if (.) year == "2020" else { :
argument is not interpretable as logical
In addition: Warning message:
In if (.) year == "2020" else { :
the condition has length > 1 and only the first element will be used
Since I have a large dataset with multiple topics and years, and with different conditions in filter(value < 5)
, I am trying to save them in a list and I will store them in different dataframes.
Can someone please help how to solve the above error(s)?
CodePudding user response:
Your syntax is not valid - you cannot put an if statement in the middle of a pipe like that. If you want to do it this way you should do something like:
if(Year==2020) {
df_list[i] <- filter(df, value<5)
} else {
df_list[i] <- filter(df, value>5)
}
However you do not need a loop for this at all. Using the data you defined you can do:
# Filter dataframe to rows you want to keep
df <- subset(df, (Year == 2020 & value < 5) | (Year != 2020 & value > 5))
# Create a list with every combination of year/topic
split(df, list(df$Year, df$topic))
# $`2020.A`
# Timestamp topic value Year
# 1 2020-12-30 A 1 2020
# $`2021.A`
# Timestamp topic value Year
# 5 2021-01-03 A 10 2021
# 11 2021-01-09 A 7 2021
# 13 2021-01-11 A 9 2021
# 19 2021-01-17 A 9 2021
# 31 2021-01-29 A 8 2021
# $`2020.B`
# [1] Timestamp topic value Year
# <0 rows> (or 0-length row.names)
# $`2021.B`
# Timestamp topic value Year
# 4 2021-01-02 B 9 2021
# 8 2021-01-06 B 10 2021
# 10 2021-01-08 B 8 2021
# 16 2021-01-14 B 10 2021
# 20 2021-01-18 B 9 2021
# 26 2021-01-24 B 8 2021
# 28 2021-01-26 B 10 2021
# 30 2021-01-28 B 10 2021
# 32 2021-01-30 B 6 2021
EDIT: I realised I had not used your seed - updated with the same data as in the question.
CodePudding user response:
Another solution, giving a list of data frames as requested, and retaining the possibility of looping over other stuff if necessary. There may be a way to avoid having to reset the data frame for the "inner loop".
df_temp <- NA
for (cat in c('A', 'B')){
df_temp <- rbind(df_temp,
df %>%
filter(topic == cat) %>%
filter((Year == '2020' & value < 5) |
(Year == '2021' & value > 5)
)
)
}
df_list[[i]] <- df_temp