I am trying to calculate the mean and standard deviation from a dataframe grouped by time and certain variables.
There is no problem when I perform the following code with a made-up data frame:
library(tidyverse)
library(reshape2)
experiment <- c("ex1","ex1","ex1","ex1","ex1","ex1","ex1","ex1","ex2","ex2","ex2","ex2","ex2","ex2","ex2","ex2")
time <- c("1","1","1","1","2","2","2","2","1","1","1","1","2","2","2","2")
bottle <- c("1","2","3","4","1","2","3","4","1","2","3","4","1","2","3","4")
var1 <- runif(16, min=2, max=50)
var2 <- runif(16, min=2, max=50)
var3 <- runif(16, min=2, max=50)
df <- data.frame(experiment,time,bottle,var1,var2,var3)
df_melt <- melt(df, id = c("experiment", "time","bottle"))
df_avg <- df_melt %>%
filter(experiment %in% c("ex2")) %>%
filter(bottle %in% c(1,2,3)) %>%
filter(variable %in% c("var1","var2")) %>%
group_by(time, variable) %>%
summarise(mean = mean(value), sd= sd(value))
The above gives me the mean and sd no problem.
However, when I perform the exact same code but using my real data, it calculates the standard deviation but return all NA for mean. I know sharing data file is not encouraged on Stack Overflow, but I am really spinning my wheels here.
Here is my data file, and the code is the following:
data <- read.csv("path to data file",fileEncoding="UTF-8-BOM")
data_melted <- melt(data, id = c("Experiment", "Time", "Bottle"))
data_avg <- data_melted %>%
filter(Experiment %in% c("Vienn_U-A")) %>%
filter(Bottle %in% c(1,2,4,5)) %>%
filter(variable %in% c("NH4","NO2","Urea")) %>%
group_by(Time, variable) %>%
summarise(mean = mean(value), sd= sd(value))
This is what the output looks like:
> data_avg
# A tibble: 54 x 4
# Groups: Time [18]
Time variable mean sd
<dbl> <fct> <dbl> <dbl>
1 0 NH4 NA 0.750
2 0 Urea NA 10.4
3 0 NO2 NA 0.243
4 24.6 NH4 NA 0.539
5 24.6 Urea NA 8.87
6 24.6 NO2 NA 0.590
7 48.2 NH4 NA 0.477
8 48.2 Urea NA 8.97
9 48.2 NO2 NA 1.21
10 77.9 NH4 NA 0.899
# ... with 44 more rows
Why does it calculate the sd, but returns NA for mean? Any guidance would be really appreciated!
CodePudding user response:
Inspired by akrun's answer, I was able to resolve this by working with .xlsx
file and avoiding .csv
file. It was not a problem with having NA
values.
Here is the data file in .xlsx format, and the following code worked:
data <- read_excel("CSV for plotting.xlsx") #following akrun's suggestion
data_melted <- melt(data, id = c("Experiment", "Time","Bottle"))
data_avg <- data_melted %>%
filter(Experiment %in% c("Vienn_U-A")) %>%
filter(Bottle %in% c(1,2,4,5)) %>%
filter(variable %in% c("NH4","NO2","Urea")) %>%
group_by(Time, variable) %>%
summarise(mean = mean(value), sd= sd(value))
Output with the correct mean
values:
Time variable mean sd
<dbl> <fct> <dbl> <dbl>
1 0 NH4 4.77 0.750
2 0 Urea 281. 10.4
3 0 NO2 12.3 0.243
4 24.6 NH4 9.11 0.539
5 24.6 Urea 275. 8.87
6 24.6 NO2 17.8 0.590
7 48.2 NH4 13.6 0.477
8 48.2 Urea 268. 8.97
9 48.2 NO2 27.9 1.21
10 77.9 NH4 8.75 0.899
# ... with 44 more rows
As mentioned above, working with .csv
file still gives the problem of returning NA for the mean
.
Here is the same data file but in .csv format
data <- read.csv("CSV for plotting.csv",fileEncoding="UTF-8-BOM")
data_melted <- melt(data, id = c("Experiment", "Time","Bottle"))
data_avg <- data_melted %>%
filter(Experiment %in% c("Vienn_U-A")) %>%
filter(Bottle %in% c(1,2,4,5)) %>%
filter(variable %in% c("NH4","NO2","Urea")) %>%
group_by(Time, variable) %>%
summarise(mean = mean(value), sd= sd(value))
Output with NA
for the mean
but not sd
:
Time variable mean sd
<dbl> <fct> <dbl> <dbl>
1 0 NH4 NA 0.750
2 0 Urea NA 10.4
3 0 NO2 NA 0.243
4 24.6 NH4 NA 0.539
5 24.6 Urea NA 8.87
6 24.6 NO2 NA 0.590
7 48.2 NH4 NA 0.477
8 48.2 Urea NA 8.97
9 48.2 NO2 NA 1.21
10 77.9 NH4 NA 0.899
# ... with 44 more rows
It still baffles me why .csv
file has this problem, but at least there is a solution.
CodePudding user response:
Based on the new link, the issue is how one of the column types were changed while reading with read.csv
and not in read_excel
i.e. column 'TN' is character
class in read.csv
whereas it is numeric
> data <- read.csv("CSV for plotting.csv", na.strings = "")
> str(data)
'data.frame': 292 obs. of 9 variables:
$ Experiment: chr "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" ...
$ Time : num 0 0 0 0 0 0 0 0 0 0 ...
$ Bottle : int 1 2 3 4 5 6 7 8 9 10 ...
$ NH4 : num 550 551 546 555 551 ...
$ Urea : num 0 0 0 0 0 0 0 0 0 0 ...
$ NO2 : num 18.6 18.3 18.6 18.1 18.9 ...
$ NO3 : logi NA NA NA NA NA NA ...
$ Ln.NO2 : num 2.92 2.91 2.92 2.9 2.94 ...
$ TN : chr "568.5714286" "569" "564.2857143" "573.1428571" ...
This is partly due to some elements that are read as #REF
[81] "913.8095238" "870.1428571" "914.1190476" "946.9761905" "903.5238095" "962.6428571" "965.2619048" "982.3809524" "909.1904762" "#REF!"
and the same line is NA
while reading with read_excel
[85] 903.5238 962.6429 965.2619 982.3810 909.1905 NA
This causes major problems with melt
as the columns specified are "Experiment", "Time","Bottle"
and it implies that the other column when melt
ed will have a single type, but because it is character
in one class, the whole column converted to character
class
> str(data_melted)
'data.frame': 1752 obs. of 5 variables:
$ Experiment: chr "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" ...
$ Time : num 0 0 0 0 0 0 0 0 0 0 ...
$ Bottle : int 1 2 3 4 5 6 7 8 9 10 ...
$ variable : Factor w/ 6 levels "NH4","Urea","NO2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : chr "550" "550.7142857" "545.7142857" "555" ...
i.e. the value
column is the one of interest here, which is now a different class than expected for mean
which is not a problem for sd
as it gets converted
> sd(c("15", "25", "35"))
[1] 10
> sd(as.numeric(c("15", "25", "35"))) # same output
[1] 10
> mean(c("15", "25", "35"))
[1] NA
The reason is that according to the documentation of ?sd
x- a numeric vector or an R object but not a factor coercible to numeric by as.double(x).
i.e. it is converting internally by as.double
and the function confirms it
> sd
function (x, na.rm = FALSE)
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm))
melt
does a type conversion without throwing error. Suppose, if we use pivot_longer
(from tidyr
) it would raise an error and we would know the issue
> tidyr::pivot_longer(data, cols = NH4:TN)
Error: Can't combine `NH4` <double> and `TN` <character>.
Run `rlang::last_error()` to see where the error occurred.
Within the context of reading with read.csv
an option will be to specify the #REF
as NA
elements
> data <- read.csv("CSV for plotting.csv", na.strings = c("", "#REF!"))
> str(data)
'data.frame': 292 obs. of 9 variables:
$ Experiment: chr "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" ...
$ Time : num 0 0 0 0 0 0 0 0 0 0 ...
$ Bottle : int 1 2 3 4 5 6 7 8 9 10 ...
$ NH4 : num 550 551 546 555 551 ...
$ Urea : num 0 0 0 0 0 0 0 0 0 0 ...
$ NO2 : num 18.6 18.3 18.6 18.1 18.9 ...
$ NO3 : logi NA NA NA NA NA NA ...
$ Ln.NO2 : num 2.92 2.91 2.92 2.9 2.94 ...
$ TN : num 569 569 564 573 570 ...
Now, the melt
would work correctly
> data_melted <- melt(data, id = c("Experiment", "Time","Bottle"))
> str(data_melted)
'data.frame': 1752 obs. of 5 variables:
$ Experiment: chr "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" ...
$ Time : num 0 0 0 0 0 0 0 0 0 0 ...
$ Bottle : int 1 2 3 4 5 6 7 8 9 10 ...
$ variable : Factor w/ 6 levels "NH4","Urea","NO2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 550 551 546 555 551 ...
So would the mean
> data_avg <- data_melted %>%
filter(Experiment %in% c("Vienn_U-A")) %>%
filter(Bottle %in% c(1,2,4,5)) %>%
filter(variable %in% c("NH4","NO2","Urea")) %>%
group_by(Time, variable) %>%
summarise(mean = mean(value), sd= sd(value))
`summarise()` has grouped output by 'Time'. You can override using the `.groups` argument.
> data_avg
# A tibble: 54 × 4
# Groups: Time [18]
Time variable mean sd
<dbl> <fct> <dbl> <dbl>
1 0 NH4 4.77 0.750
2 0 Urea 281. 10.4
3 0 NO2 12.3 0.243
4 24.6 NH4 9.11 0.539
5 24.6 Urea 275. 8.87
6 24.6 NO2 17.8 0.590
7 48.2 NH4 13.6 0.477
8 48.2 Urea 268. 8.97
9 48.2 NO2 27.9 1.21
10 77.9 NH4 8.75 0.899
# … with 44 more rows
If there are NA
values, consider using na.rm = TRUE
in the mean
and sd
. By default, those are set as FALSE
. In case, the mean
from base
is masked, use the ::
. Also, the original link seems to be excel file. So, read the file with read_excel
library(dplyr)
data_avg <- data_melted %>%
filter(Experiment %in% c("Vienn_U-A")) %>%
filter(Bottle %in% c(1,2,4,5)) %>%
filter(variable %in% c("NH4","NO2","Urea")) %>%
group_by(Time, variable) %>%
summarise(mean = base::mean(value, na.rm = TRUE), sd= sd(value, na.rm = TRUE))
-output
> data_avg
# A tibble: 54 × 4
# Groups: Time [18]
Time variable mean sd
<dbl> <fct> <dbl> <dbl>
1 0 NH4 4.77 0.750
2 0 Urea 281. 10.4
3 0 NO2 12.3 0.243
4 24.6 NH4 9.11 0.539
5 24.6 Urea 275. 8.87
6 24.6 NO2 17.8 0.590
7 48.2 NH4 13.6 0.477
8 48.2 Urea 268. 8.97
9 48.2 NO2 27.9 1.21
10 77.9 NH4 8.75 0.899
# … with 44 more rows
The file in the link shows excel
data. It may be better to read with packages that read the excel data
library(readxl)
data <- read_excel("CSV for plotting.xlsx")