I am quite new to R and got stuck with my first self written function. I have a dataset with multiple compounds from leaf extracts from different days. Here some example data.
day | values | compound |
---|---|---|
1 | 67 | Arginine |
1 | 68 | Arginine |
1 | 72 | Arginine |
2 | 63 | Arginine |
2 | 57 | Arginine |
3 | 65 | Arginine |
3 | 64 | Arginine |
1 | 370 | Proline |
1 | 355 | Proline |
2 | 400 | Proline |
3 | 450 | Proline |
1 | 800 | Citric acid |
1 | 756 | Citric acid |
2 | 780 | Citric acid |
3 | 600 | Citric acid |
I want to calculate the abbreviation for each value from the day 1 mean as a percentage. That means I want the mean from all day 1 values to be my reference value.
percentage <- function(x){
mean1 <- mean(my_data$values[my_data$compound==x & my_data$day=="1"])
percent <- ((my_data$values[my_data$compound==x] - mean1) / mean1) * 100
return(as.numeric(percent))
}
The function seems to work for single compounds, but I can not manage to operate it for all at once and when I want to add it as a new column to the data frame as so,
my_data$abbreviation <- percentage("Arginine")
I get an Error message, because the number of rows are unequal. Does anybody have an idea how I can get past this issue? Do I have to rewrite the function and if yes, how?
Thank you in advance!
my_data <-
tibble::tribble(
~day, ~values, ~compound,
1L, 67L, "Arginine",
1L, 68L, "Arginine",
1L, 72L, "Arginine",
2L, 63L, "Arginine",
2L, 57L, "Arginine",
3L, 65L, "Arginine",
3L, 64L, "Arginine",
1L, 370L, "Proline",
1L, 355L, "Proline",
2L, 400L, "Proline",
3L, 450L, "Proline",
1L, 800L, "Citric acid",
1L, 756L, "Citric acid",
2L, 780L, "Citric acid",
3L, 600L, "Citric acid"
)
CodePudding user response:
This solution uses dplyr::group_by()
for grouping the data first by compound
and day
. We can then calculate the mean for each day, set it to NA
for each row where day
does not equal 1
(dplyr::mutate()
). After ungrouping the data we can tidyr::fill()
in the reference
column. Finally, we calculate the percent
column.
library(dplyr)
library(tidyr)
my_data |>
group_by(compound, day) |>
mutate(reference = mean(values),
reference = ifelse(day == 1, reference, NA)) |>
ungroup() |>
fill(reference) |>
mutate(percent = values * 100 / reference)
#> # A tibble: 15 × 5
#> day values compound reference percent
#> <int> <int> <chr> <dbl> <dbl>
#> 1 1 67 Arginine 69 97.1
#> 2 1 68 Arginine 69 98.6
#> 3 1 72 Arginine 69 104.
#> 4 2 63 Arginine 69 91.3
#> 5 2 57 Arginine 69 82.6
#> 6 3 65 Arginine 69 94.2
#> 7 3 64 Arginine 69 92.8
#> 8 1 370 Proline 362. 102.
#> 9 1 355 Proline 362. 97.9
#> 10 2 400 Proline 362. 110.
#> 11 3 450 Proline 362. 124.
#> 12 1 800 Citric acid 778 103.
#> 13 1 756 Citric acid 778 97.2
#> 14 2 780 Citric acid 778 100.
#> 15 3 600 Citric acid 778 77.1
If you want to wrap this into a function, that could look like this:
percentage <-
function(data) {
data |>
group_by(compound, day) |>
mutate(reference = mean(values),
reference = ifelse(day == 1, reference, NA)) |>
ungroup() |>
fill(reference) |>
mutate(percent = values * 100 / reference)
}
And then apply it to my_data
:
percentage(my_data)
#> # A tibble: 15 × 5
#> day values compound reference percent
#> <int> <int> <chr> <dbl> <dbl>
#> 1 1 67 Arginine 69 97.1
#> 2 1 68 Arginine 69 98.6
#> 3 1 72 Arginine 69 104.
#> 4 2 63 Arginine 69 91.3
#> 5 2 57 Arginine 69 82.6
#> 6 3 65 Arginine 69 94.2
#> 7 3 64 Arginine 69 92.8
#> 8 1 370 Proline 362. 102.
#> 9 1 355 Proline 362. 97.9
#> 10 2 400 Proline 362. 110.
#> 11 3 450 Proline 362. 124.
#> 12 1 800 Citric acid 778 103.
#> 13 1 756 Citric acid 778 97.2
#> 14 2 780 Citric acid 778 100.
#> 15 3 600 Citric acid 778 77.1
CodePudding user response:
You can put your function body directly in summarise
after a group_by
, and make two changes
- Don't need to use
mydata$
before column names,summarise
is a "data masking" function - Don't need to subset to a specific compound with
compound == x
, the data is already grouped
library(dplyr, warn.conflicts = F)
my_data %>%
group_by(compound) %>%
mutate(percentage = {
mean1 <- mean(values[day == "1"])
((values - mean1) / mean1) * 100
})
#> # A tibble: 15 × 4
#> # Groups: compound [3]
#> day values compound percentage
#> <int> <int> <chr> <dbl>
#> 1 1 67 Arginine -2.90
#> 2 1 68 Arginine -1.45
#> 3 1 72 Arginine 4.35
#> 4 2 63 Arginine -8.70
#> 5 2 57 Arginine -17.4
#> 6 3 65 Arginine -5.80
#> 7 3 64 Arginine -7.25
#> 8 1 370 Proline 2.07
#> 9 1 355 Proline -2.07
#> 10 2 400 Proline 10.3
#> 11 3 450 Proline 24.1
#> 12 1 800 Citric acid 2.83
#> 13 1 756 Citric acid -2.83
#> 14 2 780 Citric acid 0.257
#> 15 3 600 Citric acid -22.9
Created on 2022-03-02 by the reprex package (v2.0.1)