Home > Software design >  R: if statement inside function (lapply)
R: if statement inside function (lapply)

Time:05-25

I have a large list of dataframes with environmental variables from different localities. For each of the dataframes in the list, I want to summarize the values across locality (= group measurements of the same locality into one), using the name of the dataframes as a condition for which variables need to be summarized. For example, for a dataframe with the name 'salinity' I want to only summarize across salinity, and not the other environmental variables. Note that the different dataframes contain data from different localities, so I cannot simply merge them into one dataframe.

Let's do this with a dummy dataset:

#create list of dataframes
df1 = data.frame(locality = c(1, 2, 2, 5, 7, 7, 9),
                     Temp = c(14, 15, 16, 18, 20, 18, 21),
                     Sal = c(16, NA, NA, 12, NA, NA, 9))

df2 = data.frame(locality = c(1, 1, 3, 6, 8, 9, 9),
                 Temp = c(1, 2, 4, 5, 0, 2, -1),
                 Sal = c(18, NA, NA, NA, 36, NA, NA))

df3 = data.frame(locality = c(1, 3, 4, 4, 5, 5, 9),
                 Temp = c(14, NA, NA, NA, 17, 18, 21),
                 Sal = c(16, 8, 24, 23, 11, 12, 9))

df4 = data.frame(locality = c(1, 1, 1, 4, 7, 8, 10),
                 Temp = c(1, NA, NA, NA, NA, 0, 2),
                 Sal = c(18, 17, 13, 16, 20, 36, 30))

df_list = list(df1, df2, df3, df4)
names(df_list) = c("Summer_temperature", "Winter_temperature",
                   "Summer_salinity", "Winter_salinity")

Next, I used lapply to summarize environmental variables:

#select only those dataframes in the list that have either 'salinity' or 'temperature' in the dataframe names
df_sal = df_list[grep("salinity", names(df_list))]  
df_temp = df_list[grep("temperature", names(df_list))]  

#use apply to summarize salinity or temperature values in each dataframe
##salinity
df_sal2 = lapply(df_sal, function(x) {
      x %>%
        group_by(locality) %>% 
        summarise(Sal = mean(Sal, na.rm = TRUE)) 
    })
        
##temperature
df_temp2 = lapply(df_temp, function(x) {
      x %>%
        group_by(locality) %>% 
        summarise(Temp = mean(Temp, na.rm = TRUE)) 
    })

Now, this code is repetitive, so I want to downsize this by combining everything into one function. This is what I tried:

df_env = lapply(df_list, function(x) {
  if (grepl("salinity", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Sal = mean(Sal, na.rm = TRUE))}
  if (grepl("temperature", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Temp = mean(Temp, na.rm = TRUE))}
  })

But I am getting the following output:

$Summer_temperature
NULL

$Winter_temperature
NULL

$Summer_salinity
NULL

$Winter_salinity
NULL

And the following warning messages:

Warning messages:
1: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
2: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
3: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
4: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
5: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
6: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
7: In if (grepl("salinity", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used
8: In if (grepl("temperature", names(x)) == TRUE) { :
  the condition has length > 1 and only the first element will be used

Now, I read here that this warning message can potentially be solved by using ifelse. However, in the final dataset I will have more than two environmental variables, so I will have to add many more if statements - for this reason I believe ifelse is not a solution here. Does anyone have an elegant solution to my problem? I am new to using both functions and lapply, and would appreciate any help you can give me.

EDIT:

I tried using the else if option suggested in one of the answers, but this still returns NULL values. I also tried the return and assigning output to x but both have the same problem as below code - any ideas?

#else if
df_env = lapply(df_list, function(x) {
  if (grepl("salinity", names(x)) == TRUE) {
    x %>% group_by(locality) %>% 
      summarise(Sal = mean(Sal, na.rm = TRUE))}
  else if (grepl("temperature", names(x)) == TRUE) {
    x %>% group_by(locality) %>% 
      summarise(Temp = mean(Temp, na.rm = TRUE))}
})
df_env

What I think is happening is that my if argument does not get passed to the summarize function, so nothing is being summarized.

CodePudding user response:

Several things going on here, including

  1. as akrun said, if statements must have a condition with a length of 1. Yours are not.

    grepl("locality", names(df1))
    # [1]  TRUE FALSE FALSE
    

    That must be reduced so that it is always exactly length 1. Frankly, grepl is the wrong tool here, since technically a column named notlocality would match and then it would error. I suggest you change to

    "locality" %in% names(df1)
    # [1] TRUE
    
  2. You need to return something. Always. You shifted from if ...; if ...; to if ... else if ..., which is a good start, but really if you meet neither condition, then nothing is returned. I suggest one of the following: either add one more } else x, or reassign as if (..) { x <- x %>% ...; } else if (..) { x <- x %>% ... ; } and then end the anon-func with just x (to return it).

However, I think ultimately the problem is that you are looking for "temperature" or "salinity" which are in the names of the list-objects, not in the frames themselves. For instance, your reference to names(x) is returning c("locality", "Temp", "Sal"), the names of the frame x itself.

I think this is what you want?

Map(function(x, nm) {
  if (grepl("salinity", nm)) {
    x %>%
      group_by(locality) %>%
      summarize(Sal = mean(Sal, na.rm = TRUE))
  } else if (grepl("temperature", nm)) {
    x %>%
      group_by(locality) %>%
      summarize(Temp = mean(Temp, na.rm = TRUE))
  } else x
}, df_list, names(df_list))
# $Summer_temperature
# # A tibble: 5 x 2
#   locality  Temp
#      <dbl> <dbl>
# 1        1  14  
# 2        2  15.5
# 3        5  18  
# 4        7  19  
# 5        9  21  
# $Winter_temperature
# # A tibble: 5 x 2
#   locality  Temp
#      <dbl> <dbl>
# 1        1   1.5
# 2        3   4  
# 3        6   5  
# 4        8   0  
# 5        9   0.5
# $Summer_salinity
# # A tibble: 5 x 2
#   locality   Sal
#      <dbl> <dbl>
# 1        1  16  
# 2        3   8  
# 3        4  23.5
# 4        5  11.5
# 5        9   9  
# $Winter_salinity
# # A tibble: 5 x 2
#   locality   Sal
#      <dbl> <dbl>
# 1        1    16
# 2        4    16
# 3        7    20
# 4        8    36
# 5       10    30
  • Related