Calculations with columns of different row-length: Calculate the percentage abbreviation from a refe-CodePudding

I am quite new to R and got stuck with my first self written function. I have a dataset with multiple compounds from leaf extracts from different days. Here some example data.

day	values	compound
1	67	Arginine
1	68	Arginine
1	72	Arginine
2	63	Arginine
2	57	Arginine
3	65	Arginine
3	64	Arginine
1	370	Proline
1	355	Proline
2	400	Proline
3	450	Proline
1	800	Citric acid
1	756	Citric acid
2	780	Citric acid
3	600	Citric acid

I want to calculate the abbreviation for each value from the day 1 mean as a percentage. That means I want the mean from all day 1 values to be my reference value.

percentage <- function(x){
  mean1 <- mean(my_data$values[my_data$compound==x & my_data$day=="1"])
  percent <- ((my_data$values[my_data$compound==x] - mean1) / mean1) * 100
  return(as.numeric(percent))
}

The function seems to work for single compounds, but I can not manage to operate it for all at once and when I want to add it as a new column to the data frame as so,

my_data$abbreviation <- percentage("Arginine")

I get an Error message, because the number of rows are unequal. Does anybody have an idea how I can get past this issue? Do I have to rewrite the function and if yes, how?

Thank you in advance!

my_data <- 
  tibble::tribble(
  ~day, ~values,     ~compound,
    1L,     67L,    "Arginine",
    1L,     68L,    "Arginine",
    1L,     72L,    "Arginine",
    2L,     63L,    "Arginine",
    2L,     57L,    "Arginine",
    3L,     65L,    "Arginine",
    3L,     64L,    "Arginine",
    1L,    370L,     "Proline",
    1L,    355L,     "Proline",
    2L,    400L,     "Proline",
    3L,    450L,     "Proline",
    1L,    800L, "Citric acid",
    1L,    756L, "Citric acid",
    2L,    780L, "Citric acid",
    3L,    600L, "Citric acid"
  )

CodePudding user response：

This solution uses dplyr::group_by() for grouping the data first by compound and day. We can then calculate the mean for each day, set it to NA for each row where day does not equal 1 (dplyr::mutate()). After ungrouping the data we can tidyr::fill() in the reference column. Finally, we calculate the percent column.

library(dplyr)
library(tidyr)

my_data |> 
  group_by(compound, day) |> 
  mutate(reference = mean(values),
         reference = ifelse(day == 1, reference, NA)) |> 
  ungroup() |> 
  fill(reference) |> 
  mutate(percent = values * 100 / reference)
#> # A tibble: 15 × 5
#>      day values compound    reference percent
#>    <int>  <int> <chr>           <dbl>   <dbl>
#>  1     1     67 Arginine          69     97.1
#>  2     1     68 Arginine          69     98.6
#>  3     1     72 Arginine          69    104. 
#>  4     2     63 Arginine          69     91.3
#>  5     2     57 Arginine          69     82.6
#>  6     3     65 Arginine          69     94.2
#>  7     3     64 Arginine          69     92.8
#>  8     1    370 Proline          362.   102. 
#>  9     1    355 Proline          362.    97.9
#> 10     2    400 Proline          362.   110. 
#> 11     3    450 Proline          362.   124. 
#> 12     1    800 Citric acid      778    103. 
#> 13     1    756 Citric acid      778     97.2
#> 14     2    780 Citric acid      778    100. 
#> 15     3    600 Citric acid      778     77.1

If you want to wrap this into a function, that could look like this:

percentage <- 
  function(data) {
    data |> 
      group_by(compound, day) |> 
      mutate(reference = mean(values),
             reference = ifelse(day == 1, reference, NA)) |> 
      ungroup() |> 
      fill(reference) |> 
      mutate(percent = values * 100 / reference)
  }

And then apply it to my_data:

percentage(my_data)
#> # A tibble: 15 × 5
#>      day values compound    reference percent
#>    <int>  <int> <chr>           <dbl>   <dbl>
#>  1     1     67 Arginine          69     97.1
#>  2     1     68 Arginine          69     98.6
#>  3     1     72 Arginine          69    104. 
#>  4     2     63 Arginine          69     91.3
#>  5     2     57 Arginine          69     82.6
#>  6     3     65 Arginine          69     94.2
#>  7     3     64 Arginine          69     92.8
#>  8     1    370 Proline          362.   102. 
#>  9     1    355 Proline          362.    97.9
#> 10     2    400 Proline          362.   110. 
#> 11     3    450 Proline          362.   124. 
#> 12     1    800 Citric acid      778    103. 
#> 13     1    756 Citric acid      778     97.2
#> 14     2    780 Citric acid      778    100. 
#> 15     3    600 Citric acid      778     77.1

CodePudding user response：

You can put your function body directly in summarise after a group_by, and make two changes

Don't need to use mydata$ before column names, summarise is a "data masking" function
Don't need to subset to a specific compound with compound == x, the data is already grouped

library(dplyr, warn.conflicts = F)

my_data %>% 
  group_by(compound) %>% 
  mutate(percentage = {
    mean1 <- mean(values[day == "1"])
    ((values - mean1) / mean1) * 100
  })
#> # A tibble: 15 × 4
#> # Groups:   compound [3]
#>      day values compound    percentage
#>    <int>  <int> <chr>            <dbl>
#>  1     1     67 Arginine        -2.90 
#>  2     1     68 Arginine        -1.45 
#>  3     1     72 Arginine         4.35 
#>  4     2     63 Arginine        -8.70 
#>  5     2     57 Arginine       -17.4  
#>  6     3     65 Arginine        -5.80 
#>  7     3     64 Arginine        -7.25 
#>  8     1    370 Proline          2.07 
#>  9     1    355 Proline         -2.07 
#> 10     2    400 Proline         10.3  
#> 11     3    450 Proline         24.1  
#> 12     1    800 Citric acid      2.83 
#> 13     1    756 Citric acid     -2.83 
#> 14     2    780 Citric acid      0.257
#> 15     3    600 Citric acid    -22.9

^{Created on 2022-03-02 by the reprex package (v2.0.1)}