I have a large list of dataframes with environmental variables from different localities. For each of the dataframes in the list, I want to summarize the values across locality (= group measurements of the same locality into one), using the name of the dataframes as a condition for which variables need to be summarized. For example, for a dataframe with the name 'salinity' I want to only summarize across salinity, and not the other environmental variables. Note that the different dataframes contain data from different localities, so I cannot simply merge them into one dataframe.
Let's do this with a dummy dataset:
#create list of dataframes
df1 = data.frame(locality = c(1, 2, 2, 5, 7, 7, 9),
Temp = c(14, 15, 16, 18, 20, 18, 21),
Sal = c(16, NA, NA, 12, NA, NA, 9))
df2 = data.frame(locality = c(1, 1, 3, 6, 8, 9, 9),
Temp = c(1, 2, 4, 5, 0, 2, -1),
Sal = c(18, NA, NA, NA, 36, NA, NA))
df3 = data.frame(locality = c(1, 3, 4, 4, 5, 5, 9),
Temp = c(14, NA, NA, NA, 17, 18, 21),
Sal = c(16, 8, 24, 23, 11, 12, 9))
df4 = data.frame(locality = c(1, 1, 1, 4, 7, 8, 10),
Temp = c(1, NA, NA, NA, NA, 0, 2),
Sal = c(18, 17, 13, 16, 20, 36, 30))
df_list = list(df1, df2, df3, df4)
names(df_list) = c("Summer_temperature", "Winter_temperature",
"Summer_salinity", "Winter_salinity")
Next, I used lapply to summarize environmental variables:
#select only those dataframes in the list that have either 'salinity' or 'temperature' in the dataframe names
df_sal = df_list[grep("salinity", names(df_list))]
df_temp = df_list[grep("temperature", names(df_list))]
#use apply to summarize salinity or temperature values in each dataframe
##salinity
df_sal2 = lapply(df_sal, function(x) {
x %>%
group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))
})
##temperature
df_temp2 = lapply(df_temp, function(x) {
x %>%
group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))
})
Now, this code is repetitive, so I want to downsize this by combining everything into one function. This is what I tried:
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Sal = mean(Sal, na.rm = TRUE))}
if (grepl("temperature", names(x)) == TRUE) {x %>% group_by(locality) %>% summarise(Temp = mean(Temp, na.rm = TRUE))}
})
But I am getting the following output:
$Summer_temperature
NULL
$Winter_temperature
NULL
$Summer_salinity
NULL
$Winter_salinity
NULL
And the following warning messages:
Warning messages:
1: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
2: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
3: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
4: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
5: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
6: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
7: In if (grepl("salinity", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
8: In if (grepl("temperature", names(x)) == TRUE) { :
the condition has length > 1 and only the first element will be used
Now, I read here that this warning message can potentially be solved by using ifelse
. However, in the final dataset I will have more than two environmental variables, so I will have to add many more if
statements - for this reason I believe ifelse
is not a solution here. Does anyone have an elegant solution to my problem? I am new to using both functions and lapply, and would appreciate any help you can give me.
EDIT:
I tried using the else if option suggested in one of the answers, but this still returns NULL values. I also tried the return and assigning output to x but both have the same problem as below code - any ideas?
#else if
df_env = lapply(df_list, function(x) {
if (grepl("salinity", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Sal = mean(Sal, na.rm = TRUE))}
else if (grepl("temperature", names(x)) == TRUE) {
x %>% group_by(locality) %>%
summarise(Temp = mean(Temp, na.rm = TRUE))}
})
df_env
What I think is happening is that my if argument does not get passed to the summarize function, so nothing is being summarized.
CodePudding user response:
Several things going on here, including
as akrun said,
if
statements must have a condition with a length of 1. Yours are not.grepl("locality", names(df1)) # [1] TRUE FALSE FALSE
That must be reduced so that it is always exactly length 1. Frankly,
grepl
is the wrong tool here, since technically a column namednotlocality
would match and then it would error. I suggest you change to"locality" %in% names(df1) # [1] TRUE
You need to return something. Always. You shifted from
if ...; if ...;
toif ... else if ...
, which is a good start, but really if you meet neither condition, then nothing is returned. I suggest one of the following: either add one more} else x
, or reassign asif (..) { x <- x %>% ...; } else if (..) { x <- x %>% ... ; }
and then end the anon-func with justx
(to return it).
However, I think ultimately the problem is that you are looking for "temperature"
or "salinity"
which are in the names of the list
-objects, not in the frames themselves. For instance, your reference to names(x)
is returning c("locality", "Temp", "Sal")
, the names of the frame x
itself.
I think this is what you want?
Map(function(x, nm) {
if (grepl("salinity", nm)) {
x %>%
group_by(locality) %>%
summarize(Sal = mean(Sal, na.rm = TRUE))
} else if (grepl("temperature", nm)) {
x %>%
group_by(locality) %>%
summarize(Temp = mean(Temp, na.rm = TRUE))
} else x
}, df_list, names(df_list))
# $Summer_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 14
# 2 2 15.5
# 3 5 18
# 4 7 19
# 5 9 21
# $Winter_temperature
# # A tibble: 5 x 2
# locality Temp
# <dbl> <dbl>
# 1 1 1.5
# 2 3 4
# 3 6 5
# 4 8 0
# 5 9 0.5
# $Summer_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 3 8
# 3 4 23.5
# 4 5 11.5
# 5 9 9
# $Winter_salinity
# # A tibble: 5 x 2
# locality Sal
# <dbl> <dbl>
# 1 1 16
# 2 4 16
# 3 7 20
# 4 8 36
# 5 10 30