Home > Software engineering >  Unable to use if else inside multiple filters in R
Unable to use if else inside multiple filters in R

Time:07-05

I am trying to filter my dataset based on multiple parameters within a loop. Below is a sample dataset;

set.seed(42)
dates <- seq.Date(as.Date("2020-12-30"), as.Date("2021-01-30"), "day")
n <- length(dates)
df <- data.frame(Timestamp = dates,
                  topic = rep(LETTERS[1:2], n/2),
                  value = sample(1:10, n, replace=TRUE))

df <- df %>% mutate(Year = lubridate::year(Timestamp))

And following is the code I am using to filter my data:

df_list <- list() #to save the output
i = 1

for (cat in c('A', 'B')){
  for (year in c('2020', '2021')){
    
    df_list[i] <- df %>% 
      filter(topic == cat) %>%
      filter(Year == year) %>% 
      {if(year == '2020'){
        filter(value < 5)
      } else {filter(value > 5)}
      }
  }
}

But it gives me the following error:

Error in filter(value < 5) : object 'value' not found

And if I remove the { before if and its closing bracket, it gives me the following error:

Error in if (.) year == "2020" else { : 
  argument is not interpretable as logical
In addition: Warning message:
In if (.) year == "2020" else { :
  the condition has length > 1 and only the first element will be used

Since I have a large dataset with multiple topics and years, and with different conditions in filter(value < 5), I am trying to save them in a list and I will store them in different dataframes.

Can someone please help how to solve the above error(s)?

CodePudding user response:

Your syntax is not valid - you cannot put an if statement in the middle of a pipe like that. If you want to do it this way you should do something like:

if(Year==2020) {
   df_list[i] <- filter(df, value<5)
} else {
   df_list[i] <- filter(df, value>5)
}

However you do not need a loop for this at all. Using the data you defined you can do:

# Filter dataframe to rows you want to keep
df <- subset(df, (Year == 2020 & value < 5) | (Year != 2020 & value > 5))

# Create a list with every combination of year/topic
split(df, list(df$Year, df$topic))
# $`2020.A`
#    Timestamp topic value Year
# 1 2020-12-30     A     1 2020

# $`2021.A`
#     Timestamp topic value Year
# 5  2021-01-03     A    10 2021
# 11 2021-01-09     A     7 2021
# 13 2021-01-11     A     9 2021
# 19 2021-01-17     A     9 2021
# 31 2021-01-29     A     8 2021

# $`2020.B`
# [1] Timestamp topic     value     Year
# <0 rows> (or 0-length row.names)

# $`2021.B`
#     Timestamp topic value Year
# 4  2021-01-02     B     9 2021
# 8  2021-01-06     B    10 2021
# 10 2021-01-08     B     8 2021
# 16 2021-01-14     B    10 2021
# 20 2021-01-18     B     9 2021
# 26 2021-01-24     B     8 2021
# 28 2021-01-26     B    10 2021
# 30 2021-01-28     B    10 2021
# 32 2021-01-30     B     6 2021

EDIT: I realised I had not used your seed - updated with the same data as in the question.

CodePudding user response:

Another solution, giving a list of data frames as requested, and retaining the possibility of looping over other stuff if necessary. There may be a way to avoid having to reset the data frame for the "inner loop".

df_temp <- NA
for (cat in c('A', 'B')){
df_temp <- rbind(df_temp, 
                 df %>% 
      filter(topic == cat) %>%
      filter((Year == '2020' & value < 5) | 
               (Year == '2021' & value > 5) 
             )
      )
}
df_list[[i]] <- df_temp
  • Related