Home > Software engineering >  aggregate columns across table using wildcard in R
aggregate columns across table using wildcard in R

Time:09-15

I have an R dataframe like so:

TS                          Wafer(1)Radius(38)  Wafer(1)Radius(49)  Wafer(2)Radius(06)  Wafer(2)Radius(38)
2022-06-29T02:33:34.827582  699.565925              699.726346          700.526022          700.820165
2022-06-29T02:33:42.277582  700.053016              700.081032          700.558847          700.367924
2022-06-29T02:33:49.727582  699.707056              699.761278          700.485359          700.958339
2022-06-29T02:33:57.169972  699.909119              700.04703           700.419578          700.565945
2022-06-29T02:34:04.617582  699.74137               699.698093          700.397555          700.533587
2022-06-29T02:34:12.067582  700.106121              699.868698          700.573662          700.611651

I would like to get the min/max/mean for each Wafer grouped by each row. For example, for the first row/TS, the max for Wafer(1) would be 699.726346. The max for Wafer(2) would be 700.820165. This would continue down the rows for each TS.

I know there is a pmax() function that can find the max/min across multiple columns but how do I group it to wafer1, wafer 2 etc? Also I would like to apply other statistics such as mean median etc. Is this possible?

Expected result would be something like:

TS                          Wafer(1)Radius(38)      Wafer(1)Radius(49)      Wafer(2)Radius(06)      Wafer(2)Radius(38)          Wafer1_Max          wafer2_max      wafer1_min      wafer2_min      wafer1_mean     wafer2_mean
2022-06-29T02:33:34.827582      699.565925                  699.726346      700.526022                  700.820165              699.726346          700.820165      699.565925      700.526022      699.6461355     700.6730935
2022-06-29T02:33:42.277582      700.053016                  700.081032      700.558847                  700.367924              700.081032          700.558847      700.053016      700.367924      700.067024      700.4633855
2022-06-29T02:33:49.727582      699.707056                  699.761278      700.485359                  700.958339              699.761278          700.958339      699.707056      700.485359      699.734167      700.721849
2022-06-29T02:33:57.169972      699.909119                  700.04703       700.419578                  700.565945              700.04703           700.565945      699.909119      700.419578      699.9780745     700.4927615
2022-06-29T02:34:04.617582      699.74137                   699.698093      700.397555                  700.533587              699.74137           700.533587      699.698093      700.397555      699.7197315     700.465571
2022-06-29T02:34:12.067582      700.106121                  699.868698      700.573662                  700.611651              700.106121          700.611651      699.868698      700.573662      699.9874095     700.5926565

CodePudding user response:

If we pivot your data to a long format, this gets much easier as we can use standard grouping functions:

library(tidyr)
library(dplyr)
df %>%
  pivot_longer(
    -TS,
    names_pattern = "([0-9] ).*([0-9] )",
    names_to = c("Wafer", "Radius"),
    values_to = "value"
  ) %>%
  group_by(TS, Wafer) %>%
  summarize(
    max = max(value),
    min = min(value),
    mean = mean(value),
    median = median(value),
    .groups = "drop"
  ) %>%
  as.data.frame
#                            TS Wafer      max      min     mean   median
# 1  2022-06-29T02:33:34.827582     1 699.7263 699.5659 699.6461 699.6461
# 2  2022-06-29T02:33:34.827582     2 700.8202 700.5260 700.6731 700.6731
# 3  2022-06-29T02:33:42.277582     1 700.0810 700.0530 700.0670 700.0670
# 4  2022-06-29T02:33:42.277582     2 700.5588 700.3679 700.4634 700.4634
# 5  2022-06-29T02:33:49.727582     1 699.7613 699.7071 699.7342 699.7342
# 6  2022-06-29T02:33:49.727582     2 700.9583 700.4854 700.7218 700.7218
# 7  2022-06-29T02:33:57.169972     1 700.0470 699.9091 699.9781 699.9781
# 8  2022-06-29T02:33:57.169972     2 700.5659 700.4196 700.4928 700.4928
# 9  2022-06-29T02:34:04.617582     1 699.7414 699.6981 699.7197 699.7197
# 10 2022-06-29T02:34:04.617582     2 700.5336 700.3976 700.4656 700.4656
# 11 2022-06-29T02:34:12.067582     1 700.1061 699.8687 699.9874 699.9874
# 12 2022-06-29T02:34:12.067582     2 700.6117 700.5737 700.5927 700.5927

To get back to wide format you can use this:

... %>%
  pivot_wider(
    names_from = "Wafer",
    values_from = c("max", "min", "mean", "median"),
    names_glue = "Wafer({Wafer})_{.value}"
  )
# # A tibble: 6 × 9
#   TS                         `Wafer(1)_max` Wafer(2)_ma…¹ Wafer…² Wafer…³ Wafer…⁴ Wafer…⁵ Wafer…⁶ Wafer…⁷
#   <chr>                               <dbl>         <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
# 1 2022-06-29T02:33:34.827582           700.          701.    700.    701.    700.    701.    700.    701.
# 2 2022-06-29T02:33:42.277582           700.          701.    700.    700.    700.    700.    700.    700.
# 3 2022-06-29T02:33:49.727582           700.          701.    700.    700.    700.    701.    700.    701.
# 4 2022-06-29T02:33:57.169972           700.          701.    700.    700.    700.    700.    700.    700.
# 5 2022-06-29T02:34:04.617582           700.          701.    700.    700.    700.    700.    700.    700.
# 6 2022-06-29T02:34:12.067582           700.          701.    700.    701.    700.    701.    700.    701.
# # … with abbreviated variable names ¹​`Wafer(2)_max`, ²​`Wafer(1)_min`, ³​`Wafer(2)_min`, ⁴​`Wafer(1)_mean`,
# #   ⁵​`Wafer(2)_mean`, ⁶​`Wafer(1)_median`, ⁷​`Wafer(2)_median`
  •  Tags:  
  • r
  • Related