Home > OS >  How to apply custom function to multiple columns in R and store many dataframes using loops
How to apply custom function to multiple columns in R and store many dataframes using loops

Time:12-17

I have a large set (many, actually) of temperature data from laboratory fire experiments looking at temperature profiles at various distances from the fire source. These output files look something like this:

Time_series Temperature_1 Temperature_2 Temperature_3 Temperature_4
0 977.1874 843.6411 962.6087 720.8003
0.002 973.9924 840.3609 960.572 724.4845
0.004 970.8031 837.1157 957.9192 727.7389

but with many more temperature measurements and time series. What I need to do is capture the mean, minimum, maximum, and intermittency of each column in a dataframe for each of the ~200 different data files. The first three I have figured out using the summarize_all(df, mean) command. Intermittency, however, requires doing something similar to this with a custom function, which in this case is finding the percentage of total temperature values that are greater than 500. So in Excel to calculate this I would use something like

=COUNTIF(B2:B46001, ">500")/COUNT(B2:B46001)

where I divide the count of values >500 by the count of all values in each column. Being more of an Excel person I'm wondering what is the best way to do this in R? Otherwise it's a long manual process to go through each file...

Also, if anyone has any advice on how to run this through on multiple dataframes using some sort of loop, so that I end up with a full dataframe of all the values I need, that would be even better! Each dataframe has an identifier for the experiment # (from C2700 to C2890) and I'm not sure if there's a way to cycle through each of those and combine the results into one big dataframe...

As mentioned I have been able to summarize the mean, min, and max, but am stuck on how to summarize using a custom function. Currently I have something simple like this:

stats_mean <- summarize_all(df1, mean)
stats_min <- summarize_all(df1, min)
stats_max <- summarize_all(df1, max)
stats_intermit <- ???

stats <- rbind(stats_mean, stats_max, stats_min, stats_intermit)

I don't yet know enough about loops to easily determine the best way to run this for multiple dataframes, so any advice on that would also be appreciated!

CodePudding user response:

Put your data.frames into a list and use lapply to go through each. As for your intermittency problem, you could try something along the lines of

xy <- data.frame(matrix(1:9, nrow = 3))
apply(xy[, 2:3], MARGIN = 2, FUN = function(x) sum(x > 5) / length(x) )

       X2        X3 
0.3333333 1.0000000

CodePudding user response:

Try this combination for all of it in one go (using bind_rows). We can run multiple functions using across, where a named list of functions is an easy way to see things:

(FYI: I changed on value in quux so that we'd have at least one value not above 500.)

quux <- structure(list(Time_series = c(0, 0.002, 0.004), Temperature_1 = c(977.1874, 473, 970.8031), Temperature_2 = c(843.6411, 840.3609, 837.1157), Temperature_3 = c(962.6087, 960.572, 957.9192), Temperature_4 = c(720.8003, 724.4845, 727.7389)), row.names = c(NA, -3L), class = "data.frame")
quuxs <- list(A=quux, B=quux, C=quux)

library(dplyr)
bind_rows(quuxs, .id = "id") %>%
  group_by(id) %>%
  summarize(across(everything(), list(
    mean = ~ mean(.), min = ~ min(.), max = ~ max(.),
    interm = ~ mean(. > 500)))
  )
# # A tibble: 3 × 21
#   id    Time_seri…¹ Time_…² Time_…³ Time_…⁴ Tempe…⁵ Tempe…⁶ Tempe…⁷ Tempe…⁸ Tempe…⁹ Tempe…˟ Tempe…˟ Tempe…˟ Tempe…˟ Tempe…˟ Tempe…˟
#   <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
# 1 A           0.002       0   0.004       0    807.     473    977.   0.667    840.    837.    844.       1    960.    958.    963.
# 2 B           0.002       0   0.004       0    807.     473    977.   0.667    840.    837.    844.       1    960.    958.    963.
# 3 C           0.002       0   0.004       0    807.     473    977.   0.667    840.    837.    844.       1    960.    958.    963.
# # … with 5 more variables: Temperature_3_interm <dbl>, Temperature_4_mean <dbl>, Temperature_4_min <dbl>, Temperature_4_max <dbl>,
# #   Temperature_4_interm <dbl>, and abbreviated variable names ¹​Time_series_mean, ²​Time_series_min, ³​Time_series_max,
# #   ⁴​Time_series_interm, ⁵​Temperature_1_mean, ⁶​Temperature_1_min, ⁷​Temperature_1_max, ⁸​Temperature_1_interm, ⁹​Temperature_2_mean,
# #   ˟​Temperature_2_min, ˟​Temperature_2_max, ˟​Temperature_2_interm, ˟​Temperature_3_mean, ˟​Temperature_3_min, ˟​Temperature_3_max

(You may want/need to include na.rm=TRUE for each of the min/max/sum functions.)

Admittedly that's a bit noisy, but the names are clear:

... %>% names()
# names
#  [1] "id"                   "Time_series_mean"     "Time_series_min"      "Time_series_max"      "Time_series_interm"  
#  [6] "Temperature_1_mean"   "Temperature_1_min"    "Temperature_1_max"    "Temperature_1_interm" "Temperature_2_mean"  
# [11] "Temperature_2_min"    "Temperature_2_max"    "Temperature_2_interm" "Temperature_3_mean"   "Temperature_3_min"   
# [16] "Temperature_3_max"    "Temperature_3_interm" "Temperature_4_mean"   "Temperature_4_min"    "Temperature_4_max"   
# [21] "Temperature_4_interm"

If you want that a bit more compact, perhaps we can pivot it:

library(tidyr) # pivot_longer
bind_rows(quuxs, .id = "id") %>%
  group_by(id) %>%
  summarize(across(everything(), list(mean = ~ mean(.), min = ~ min(.), max = ~ max(.), interm = ~ mean(. > 500)))) %>%
  pivot_longer(-id, names_pattern = "(.*)_([^_]*)$", names_to = c("temp", ".value"))
# # A tibble: 15 × 6
#    id    temp             mean   min     max interm
#    <chr> <chr>           <dbl> <dbl>   <dbl>  <dbl>
#  1 A     Time_series     0.002    0    0.004  0    
#  2 A     Temperature_1 807.     473  977.     0.667
#  3 A     Temperature_2 840.     837. 844.     1    
#  4 A     Temperature_3 960.     958. 963.     1    
#  5 A     Temperature_4 724.     721. 728.     1    
#  6 B     Time_series     0.002    0    0.004  0    
#  7 B     Temperature_1 807.     473  977.     0.667
#  8 B     Temperature_2 840.     837. 844.     1    
#  9 B     Temperature_3 960.     958. 963.     1    
# 10 B     Temperature_4 724.     721. 728.     1    
# 11 C     Time_series     0.002    0    0.004  0    
# 12 C     Temperature_1 807.     473  977.     0.667
# 13 C     Temperature_2 840.     837. 844.     1    
# 14 C     Temperature_3 960.     958. 963.     1    
# 15 C     Temperature_4 724.     721. 728.     1    

or if you prefer a tranposed format,

bind_rows(quuxs, .id = "id") %>%
  group_by(id) %>%
  summarize(across(everything(), list(mean = ~ mean(.), min = ~ min(.), max = ~ max(.), interm = ~ mean(. > 500)))) %>%
  pivot_longer(-id, names_pattern = "(.*)_([^_]*)$", names_to = c(".value", "stat"))
# # A tibble: 12 × 7
#    id    stat   Time_series Temperature_1 Temperature_2 Temperature_3 Temperature_4
#    <chr> <chr>        <dbl>         <dbl>         <dbl>         <dbl>         <dbl>
#  1 A     mean         0.002       807.             840.          960.          724.
#  2 A     min          0           473              837.          958.          721.
#  3 A     max          0.004       977.             844.          963.          728.
#  4 A     interm       0             0.667            1             1             1 
#  5 B     mean         0.002       807.             840.          960.          724.
#  6 B     min          0           473              837.          958.          721.
#  7 B     max          0.004       977.             844.          963.          728.
#  8 B     interm       0             0.667            1             1             1 
#  9 C     mean         0.002       807.             840.          960.          724.
# 10 C     min          0           473              837.          958.          721.
# 11 C     max          0.004       977.             844.          963.          728.
# 12 C     interm       0             0.667            1             1             1 

(the only change was to names_to=).

  • Related