group_by() and summarise() calculates sd but returns NA for mean-CodePudding

I am trying to calculate the mean and standard deviation from a dataframe grouped by time and certain variables.

There is no problem when I perform the following code with a made-up data frame:

library(tidyverse)
library(reshape2)

experiment <- c("ex1","ex1","ex1","ex1","ex1","ex1","ex1","ex1","ex2","ex2","ex2","ex2","ex2","ex2","ex2","ex2")
time <- c("1","1","1","1","2","2","2","2","1","1","1","1","2","2","2","2")  
bottle <- c("1","2","3","4","1","2","3","4","1","2","3","4","1","2","3","4")
var1 <- runif(16, min=2, max=50)
var2 <- runif(16, min=2, max=50)
var3 <- runif(16, min=2, max=50)
df <- data.frame(experiment,time,bottle,var1,var2,var3)

df_melt <- melt(df, id = c("experiment", "time","bottle"))

df_avg <- df_melt %>% 
  filter(experiment %in% c("ex2")) %>%
  filter(bottle %in% c(1,2,3)) %>% 
  filter(variable %in% c("var1","var2")) %>%  
  group_by(time, variable) %>%
  summarise(mean = mean(value), sd= sd(value))

The above gives me the mean and sd no problem.

However, when I perform the exact same code but using my real data, it calculates the standard deviation but return all NA for mean. I know sharing data file is not encouraged on Stack Overflow, but I am really spinning my wheels here.

Here is my data file, and the code is the following:

data <- read.csv("path to data file",fileEncoding="UTF-8-BOM")
data_melted <- melt(data, id = c("Experiment", "Time", "Bottle"))

data_avg <- data_melted %>% 
  filter(Experiment %in% c("Vienn_U-A")) %>%
  filter(Bottle %in% c(1,2,4,5)) %>% 
  filter(variable %in% c("NH4","NO2","Urea")) %>%  
  group_by(Time, variable) %>%
  summarise(mean = mean(value), sd= sd(value))

This is what the output looks like:

> data_avg
# A tibble: 54 x 4
# Groups:   Time [18]
    Time variable  mean     sd
   <dbl> <fct>    <dbl>  <dbl>
 1   0   NH4         NA  0.750
 2   0   Urea        NA 10.4  
 3   0   NO2         NA  0.243
 4  24.6 NH4         NA  0.539
 5  24.6 Urea        NA  8.87 
 6  24.6 NO2         NA  0.590
 7  48.2 NH4         NA  0.477
 8  48.2 Urea        NA  8.97 
 9  48.2 NO2         NA  1.21 
10  77.9 NH4         NA  0.899
# ... with 44 more rows

Why does it calculate the sd, but returns NA for mean? Any guidance would be really appreciated!

CodePudding user response：

Inspired by akrun's answer, I was able to resolve this by working with .xlsx file and avoiding .csv file. It was not a problem with having NA values.

Here is the data file in .xlsx format, and the following code worked:

data <- read_excel("CSV for plotting.xlsx") #following akrun's suggestion 
data_melted <- melt(data, id = c("Experiment", "Time","Bottle"))
data_avg <- data_melted %>% 
  filter(Experiment %in% c("Vienn_U-A")) %>%
  filter(Bottle %in% c(1,2,4,5)) %>% 
  filter(variable %in% c("NH4","NO2","Urea")) %>%  
  group_by(Time, variable) %>%
  summarise(mean = mean(value), sd= sd(value))

Output with the correct mean values:

    Time variable   mean     sd
   <dbl> <fct>     <dbl>  <dbl>
 1   0   NH4        4.77  0.750
 2   0   Urea     281.   10.4  
 3   0   NO2       12.3   0.243
 4  24.6 NH4        9.11  0.539
 5  24.6 Urea     275.    8.87 
 6  24.6 NO2       17.8   0.590
 7  48.2 NH4       13.6   0.477
 8  48.2 Urea     268.    8.97 
 9  48.2 NO2       27.9   1.21 
10  77.9 NH4        8.75  0.899
# ... with 44 more rows

As mentioned above, working with .csv file still gives the problem of returning NA for the mean.

Here is the same data file but in .csv format

data <- read.csv("CSV for plotting.csv",fileEncoding="UTF-8-BOM")
data_melted <- melt(data, id = c("Experiment", "Time","Bottle"))
data_avg <- data_melted %>% 
  filter(Experiment %in% c("Vienn_U-A")) %>%
  filter(Bottle %in% c(1,2,4,5)) %>% 
  filter(variable %in% c("NH4","NO2","Urea")) %>%  
  group_by(Time, variable) %>%
  summarise(mean = mean(value), sd= sd(value))

Output with NA for the mean but not sd:

 Time variable  mean     sd
   <dbl> <fct>    <dbl>  <dbl>
 1   0   NH4         NA  0.750
 2   0   Urea        NA 10.4  
 3   0   NO2         NA  0.243
 4  24.6 NH4         NA  0.539
 5  24.6 Urea        NA  8.87 
 6  24.6 NO2         NA  0.590
 7  48.2 NH4         NA  0.477
 8  48.2 Urea        NA  8.97 
 9  48.2 NO2         NA  1.21 
10  77.9 NH4         NA  0.899
# ... with 44 more rows

It still baffles me why .csv file has this problem, but at least there is a solution.

CodePudding user response：

Based on the new link, the issue is how one of the column types were changed while reading with read.csv and not in read_excel i.e. column 'TN' is character class in read.csv whereas it is numeric

> data <- read.csv("CSV for plotting.csv", na.strings = "")
> str(data)
'data.frame':   292 obs. of  9 variables:
 $ Experiment: chr  "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" ...
 $ Time      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Bottle    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ NH4       : num  550 551 546 555 551 ...
 $ Urea      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NO2       : num  18.6 18.3 18.6 18.1 18.9 ...
 $ NO3       : logi  NA NA NA NA NA NA ...
 $ Ln.NO2    : num  2.92 2.91 2.92 2.9 2.94 ...
 $ TN        : chr  "568.5714286" "569" "564.2857143" "573.1428571" ...

This is partly due to some elements that are read as #REF

[81] "913.8095238" "870.1428571" "914.1190476" "946.9761905" "903.5238095" "962.6428571" "965.2619048" "982.3809524" "909.1904762" "#REF!"

and the same line is NA while reading with read_excel

[85]  903.5238  962.6429  965.2619  982.3810  909.1905        NA

This causes major problems with melt as the columns specified are "Experiment", "Time","Bottle" and it implies that the other column when melted will have a single type, but because it is character in one class, the whole column converted to character class

> str(data_melted)
'data.frame':   1752 obs. of  5 variables:
 $ Experiment: chr  "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" ...
 $ Time      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Bottle    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ variable  : Factor w/ 6 levels "NH4","Urea","NO2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value     : chr  "550" "550.7142857" "545.7142857" "555" ...

i.e. the value column is the one of interest here, which is now a different class than expected for mean which is not a problem for sd as it gets converted

> sd(c("15", "25", "35"))
[1] 10
>  sd(as.numeric(c("15", "25", "35"))) # same output
[1] 10
> mean(c("15", "25", "35"))
[1] NA

The reason is that according to the documentation of ?sd

x- a numeric vector or an R object but not a factor coercible to numeric by as.double(x).

i.e. it is converting internally by as.double and the function confirms it

> sd
function (x, na.rm = FALSE) 
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
    na.rm = na.rm))

melt does a type conversion without throwing error. Suppose, if we use pivot_longer (from tidyr) it would raise an error and we would know the issue

> tidyr::pivot_longer(data, cols = NH4:TN)
Error: Can't combine `NH4` <double> and `TN` <character>.
Run `rlang::last_error()` to see where the error occurred.

Within the context of reading with read.csv an option will be to specify the #REF as NA elements

> data <- read.csv("CSV for plotting.csv", na.strings = c("", "#REF!"))
> str(data)
'data.frame':   292 obs. of  9 variables:
 $ Experiment: chr  "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" ...
 $ Time      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Bottle    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ NH4       : num  550 551 546 555 551 ...
 $ Urea      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NO2       : num  18.6 18.3 18.6 18.1 18.9 ...
 $ NO3       : logi  NA NA NA NA NA NA ...
 $ Ln.NO2    : num  2.92 2.91 2.92 2.9 2.94 ...
 $ TN        : num  569 569 564 573 570 ...

Now, the melt would work correctly

> data_melted <- melt(data, id = c("Experiment", "Time","Bottle"))
> str(data_melted)
'data.frame':   1752 obs. of  5 variables:
 $ Experiment: chr  "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" "Vienn_A-U" ...
 $ Time      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Bottle    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ variable  : Factor w/ 6 levels "NH4","Urea","NO2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value     : num  550 551 546 555 551 ...

So would the mean

> data_avg <- data_melted %>% 
    filter(Experiment %in% c("Vienn_U-A")) %>%
    filter(Bottle %in% c(1,2,4,5)) %>% 
    filter(variable %in% c("NH4","NO2","Urea")) %>%  
    group_by(Time, variable) %>%
    summarise(mean = mean(value), sd= sd(value)) 
`summarise()` has grouped output by 'Time'. You can override using the `.groups` argument.
> data_avg
# A tibble: 54 × 4
# Groups:   Time [18]
    Time variable   mean     sd
   <dbl> <fct>     <dbl>  <dbl>
 1   0   NH4        4.77  0.750
 2   0   Urea     281.   10.4  
 3   0   NO2       12.3   0.243
 4  24.6 NH4        9.11  0.539
 5  24.6 Urea     275.    8.87 
 6  24.6 NO2       17.8   0.590
 7  48.2 NH4       13.6   0.477
 8  48.2 Urea     268.    8.97 
 9  48.2 NO2       27.9   1.21 
10  77.9 NH4        8.75  0.899
# … with 44 more rows

If there are NA values, consider using na.rm = TRUE in the mean and sd. By default, those are set as FALSE. In case, the mean from base is masked, use the ::. Also, the original link seems to be excel file. So, read the file with read_excel

library(dplyr)
data_avg <- data_melted %>% 
  filter(Experiment %in% c("Vienn_U-A")) %>%
  filter(Bottle %in% c(1,2,4,5)) %>% 
  filter(variable %in% c("NH4","NO2","Urea")) %>%  
  group_by(Time, variable) %>%
  summarise(mean = base::mean(value, na.rm = TRUE), sd= sd(value, na.rm = TRUE))

-output

> data_avg
# A tibble: 54 × 4
# Groups:   Time [18]
    Time variable   mean     sd
   <dbl> <fct>     <dbl>  <dbl>
 1   0   NH4        4.77  0.750
 2   0   Urea     281.   10.4  
 3   0   NO2       12.3   0.243
 4  24.6 NH4        9.11  0.539
 5  24.6 Urea     275.    8.87 
 6  24.6 NO2       17.8   0.590
 7  48.2 NH4       13.6   0.477
 8  48.2 Urea     268.    8.97 
 9  48.2 NO2       27.9   1.21 
10  77.9 NH4        8.75  0.899
# … with 44 more rows

The file in the link shows excel data. It may be better to read with packages that read the excel data

library(readxl)
data <- read_excel("CSV for plotting.xlsx")