I have a large set (many, actually) of temperature data from laboratory fire experiments looking at temperature profiles at various distances from the fire source. These output files look something like this:
Time_series | Temperature_1 | Temperature_2 | Temperature_3 | Temperature_4 |
---|---|---|---|---|
0 | 977.1874 | 843.6411 | 962.6087 | 720.8003 |
0.002 | 973.9924 | 840.3609 | 960.572 | 724.4845 |
0.004 | 970.8031 | 837.1157 | 957.9192 | 727.7389 |
but with many more temperature measurements and time series. What I need to do is capture the mean, minimum, maximum, and intermittency of each column in a dataframe for each of the ~200 different data files. The first three I have figured out using the summarize_all(df, mean) command. Intermittency, however, requires doing something similar to this with a custom function, which in this case is finding the percentage of total temperature values that are greater than 500. So in Excel to calculate this I would use something like
=COUNTIF(B2:B46001, ">500")/COUNT(B2:B46001)
where I divide the count of values >500 by the count of all values in each column. Being more of an Excel person I'm wondering what is the best way to do this in R? Otherwise it's a long manual process to go through each file...
Also, if anyone has any advice on how to run this through on multiple dataframes using some sort of loop, so that I end up with a full dataframe of all the values I need, that would be even better! Each dataframe has an identifier for the experiment # (from C2700 to C2890) and I'm not sure if there's a way to cycle through each of those and combine the results into one big dataframe...
As mentioned I have been able to summarize the mean, min, and max, but am stuck on how to summarize using a custom function. Currently I have something simple like this:
stats_mean <- summarize_all(df1, mean)
stats_min <- summarize_all(df1, min)
stats_max <- summarize_all(df1, max)
stats_intermit <- ???
stats <- rbind(stats_mean, stats_max, stats_min, stats_intermit)
I don't yet know enough about loops to easily determine the best way to run this for multiple dataframes, so any advice on that would also be appreciated!
CodePudding user response:
Put your data.frames into a list and use lapply
to go through each. As for your intermittency problem, you could try something along the lines of
xy <- data.frame(matrix(1:9, nrow = 3))
apply(xy[, 2:3], MARGIN = 2, FUN = function(x) sum(x > 5) / length(x) )
X2 X3
0.3333333 1.0000000
CodePudding user response:
Try this combination for all of it in one go (using bind_rows
). We can run multiple functions using across
, where a named list of functions is an easy way to see things:
(FYI: I changed on value in quux
so that we'd have at least one value not above 500.)
quux <- structure(list(Time_series = c(0, 0.002, 0.004), Temperature_1 = c(977.1874, 473, 970.8031), Temperature_2 = c(843.6411, 840.3609, 837.1157), Temperature_3 = c(962.6087, 960.572, 957.9192), Temperature_4 = c(720.8003, 724.4845, 727.7389)), row.names = c(NA, -3L), class = "data.frame")
quuxs <- list(A=quux, B=quux, C=quux)
library(dplyr)
bind_rows(quuxs, .id = "id") %>%
group_by(id) %>%
summarize(across(everything(), list(
mean = ~ mean(.), min = ~ min(.), max = ~ max(.),
interm = ~ mean(. > 500)))
)
# # A tibble: 3 × 21
# id Time_seri…¹ Time_…² Time_…³ Time_…⁴ Tempe…⁵ Tempe…⁶ Tempe…⁷ Tempe…⁸ Tempe…⁹ Tempe…˟ Tempe…˟ Tempe…˟ Tempe…˟ Tempe…˟ Tempe…˟
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 0.002 0 0.004 0 807. 473 977. 0.667 840. 837. 844. 1 960. 958. 963.
# 2 B 0.002 0 0.004 0 807. 473 977. 0.667 840. 837. 844. 1 960. 958. 963.
# 3 C 0.002 0 0.004 0 807. 473 977. 0.667 840. 837. 844. 1 960. 958. 963.
# # … with 5 more variables: Temperature_3_interm <dbl>, Temperature_4_mean <dbl>, Temperature_4_min <dbl>, Temperature_4_max <dbl>,
# # Temperature_4_interm <dbl>, and abbreviated variable names ¹Time_series_mean, ²Time_series_min, ³Time_series_max,
# # ⁴Time_series_interm, ⁵Temperature_1_mean, ⁶Temperature_1_min, ⁷Temperature_1_max, ⁸Temperature_1_interm, ⁹Temperature_2_mean,
# # ˟Temperature_2_min, ˟Temperature_2_max, ˟Temperature_2_interm, ˟Temperature_3_mean, ˟Temperature_3_min, ˟Temperature_3_max
(You may want/need to include na.rm=TRUE
for each of the min/max/sum functions.)
Admittedly that's a bit noisy, but the names are clear:
... %>% names()
# names
# [1] "id" "Time_series_mean" "Time_series_min" "Time_series_max" "Time_series_interm"
# [6] "Temperature_1_mean" "Temperature_1_min" "Temperature_1_max" "Temperature_1_interm" "Temperature_2_mean"
# [11] "Temperature_2_min" "Temperature_2_max" "Temperature_2_interm" "Temperature_3_mean" "Temperature_3_min"
# [16] "Temperature_3_max" "Temperature_3_interm" "Temperature_4_mean" "Temperature_4_min" "Temperature_4_max"
# [21] "Temperature_4_interm"
If you want that a bit more compact, perhaps we can pivot it:
library(tidyr) # pivot_longer
bind_rows(quuxs, .id = "id") %>%
group_by(id) %>%
summarize(across(everything(), list(mean = ~ mean(.), min = ~ min(.), max = ~ max(.), interm = ~ mean(. > 500)))) %>%
pivot_longer(-id, names_pattern = "(.*)_([^_]*)$", names_to = c("temp", ".value"))
# # A tibble: 15 × 6
# id temp mean min max interm
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 A Time_series 0.002 0 0.004 0
# 2 A Temperature_1 807. 473 977. 0.667
# 3 A Temperature_2 840. 837. 844. 1
# 4 A Temperature_3 960. 958. 963. 1
# 5 A Temperature_4 724. 721. 728. 1
# 6 B Time_series 0.002 0 0.004 0
# 7 B Temperature_1 807. 473 977. 0.667
# 8 B Temperature_2 840. 837. 844. 1
# 9 B Temperature_3 960. 958. 963. 1
# 10 B Temperature_4 724. 721. 728. 1
# 11 C Time_series 0.002 0 0.004 0
# 12 C Temperature_1 807. 473 977. 0.667
# 13 C Temperature_2 840. 837. 844. 1
# 14 C Temperature_3 960. 958. 963. 1
# 15 C Temperature_4 724. 721. 728. 1
or if you prefer a tranposed format,
bind_rows(quuxs, .id = "id") %>%
group_by(id) %>%
summarize(across(everything(), list(mean = ~ mean(.), min = ~ min(.), max = ~ max(.), interm = ~ mean(. > 500)))) %>%
pivot_longer(-id, names_pattern = "(.*)_([^_]*)$", names_to = c(".value", "stat"))
# # A tibble: 12 × 7
# id stat Time_series Temperature_1 Temperature_2 Temperature_3 Temperature_4
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A mean 0.002 807. 840. 960. 724.
# 2 A min 0 473 837. 958. 721.
# 3 A max 0.004 977. 844. 963. 728.
# 4 A interm 0 0.667 1 1 1
# 5 B mean 0.002 807. 840. 960. 724.
# 6 B min 0 473 837. 958. 721.
# 7 B max 0.004 977. 844. 963. 728.
# 8 B interm 0 0.667 1 1 1
# 9 C mean 0.002 807. 840. 960. 724.
# 10 C min 0 473 837. 958. 721.
# 11 C max 0.004 977. 844. 963. 728.
# 12 C interm 0 0.667 1 1 1
(the only change was to names_to=
).