R: if statement inside function (lapply)


I have a large list of dataframes with environmental variables from different localities. For each of the dataframes in the list, I want to summarize the values across locality (= group measurements of the same locality into one), using the name of the dataframes as a condition for which variables need to be summarized. For example, for a dataframe with the name 'salinity' I want to only summarize across salinity, and not the other environmental variables. Note that the different dataframes contain data from different localities, so I cannot simply merge them into one dataframe.

Let's do this with a dummy dataset:

#create list of dataframes
df1 = data.frame(locality = c(1, 2, 2, 5, 7, 7, 9),
                     Temp = c(14, 15, 16, 18, 20, 18, 21),
                     Sal = c(16, NA, NA, 12, NA, NA, 9))

df2 = data.frame(locality = c(1, 1, 3, 6, 8, 9, 9),
                 Temp = c(1, 2, 4, 5, 0, 2, -1),
                 Sal = c(18, NA, NA, NA, 36, NA, NA))

df3 = data.frame(locality = c(1, 3, 4, 4, 5, 5, 9),
                 Temp = c(14, NA, NA, NA, 17, 18, 21),
                 Sal = c(16, 8, 24, 23, 11, 12, 9))

df4 = data.frame(locality = c(1, 1, 1, 4, 7, 8, 10),
                 Temp = c(1, NA, NA, NA, NA, 0, 2),
                 Sal = c(18, 17, 13, 16, 20, 36, 30))

df_list = list(df1, df2, df3, df4)
names(df_list) = c("Summer_temperature", "Winter_temperature",
                   "Summer_salinity", "Winter_salinity")

Next, I used lapply to summarize environmental variables:

#select only those dataframes in the list that have either 'salinity' or 'temperature' in the dataframe names
df_sal = df_list[grep("salinity", names(df_list))]  
df_temp = df_list[grep("temperature", names(df_list))]  

#use apply to summarize salinity or temperature values in each dataframe
df_sal2 = lapply(df_sal, function(x) {
      x %>%
        group_by(locality) %>% 
        summarise(Sal = mean(Sal, na.rm = TRUE)) 
df_temp2 = lapply(df_temp, function(x) {
      x %>%
        group_by(locality) %>% 
        summarise(Temp = mean(Temp, na.rm = TRUE)) 

Now, this code is repetitive, so I want to downsize this by combining everything into one function. This is what I tried:

df_env = lapply(df_list, function(x) {
  if (grepl("salinity", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Sal = mean(Sal, na.rm = TRUE))}
  if (grepl("temperature", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Temp = mean(Temp, na.rm = TRUE))}

But I am getting the following output:





And the following warning messages:

Warning messages:
1: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
2: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
3: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
4: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
5: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
6: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
7: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
8: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used

Now, I read here that this warning message can potentially be solved by using ifelse. However, in the final dataset I will have more than two environmental variables, so I will have to add many more if statements - for this reason I believe ifelse is not a solution here. Does anyone have an elegant solution to my problem? I am new to using both functions and lapply, and would appreciate any help you can give me.


I tried using the else if option suggested in one of the answers, but this still returns NULL values. I also tried the return and assigning output to x but both have the same problem as below code - any ideas?

#else if
df_env = lapply(df_list, function(x) {
  if (grepl("salinity", names(x)) == TRUE) {
    x %>% group_by(locality) %>% 
      summarise(Sal = mean(Sal, na.rm = TRUE))}
  else if (grepl("temperature", names(x)) == TRUE) {
    x %>% group_by(locality) %>% 
      summarise(Temp = mean(Temp, na.rm = TRUE))}

What I think is happening is that my if argument does not get passed to the summarize function, so nothing is being summarized.

Several things going on here, including

  1. as akrun said, if statements must have a condition with a length of 1. Yours are not.

    grepl("locality", names(df1))

    That must be reduced so that it is always exactly length 1. Frankly, grepl is the wrong tool here, since technically a column named notlocality would match and then it would error. I suggest you change to

    "locality" %in% names(df1)
    # [1] TRUE
  2. You need to return something. Always. You shifted from if ...; if ...; to if ... else if ..., which is a good start, but really if you meet neither condition, then nothing is returned. I suggest one of the following: either add one more } else x, or reassign as if (..) { x <- x %>% ...; } else if (..) { x <- x %>% ... ; } and then end the anon-func with just x (to return it).

However, I think ultimately the problem is that you are looking for "temperature" or "salinity" which are in the names of the list-objects, not in the frames themselves. For instance, your reference to names(x) is returning c("locality", "Temp", "Sal"), the names of the frame x itself.

I think this is what you want?

Map(function(x, nm) {
  if (grepl("salinity", nm)) {
    x %>%
      group_by(locality) %>%
      summarize(Sal = mean(Sal, na.rm = TRUE))
  } else if (grepl("temperature", nm)) {
    x %>%
      group_by(locality) %>%
      summarize(Temp = mean(Temp, na.rm = TRUE))
  } else x
}, df_list, names(df_list))
# $Summer_temperature
# # A tibble: 5 x 2
#   locality  Temp
#      <dbl> <dbl>
# 1        1  14  
# 2        2  15.5
# 3        5  18  
# 4        7  19  
# 5        9  21  
# $Winter_temperature
# # A tibble: 5 x 2
#   locality  Temp
#      <dbl> <dbl>
# 1        1   1.5
# 2        3   4  
# 3        6   5  
# 4        8   0  
# 5        9   0.5
# $Summer_salinity
# # A tibble: 5 x 2
#   locality   Sal
#      <dbl> <dbl>
# 1        1  16  
# 2        3   8  
# 3        4  23.5
# 4        5  11.5
# 5        9   9  
# $Winter_salinity
# # A tibble: 5 x 2
#   locality   Sal
#      <dbl> <dbl>
# 1        1    16
# 2        4    16
# 3        7    20
# 4        8    36
# 5       10    30
