Select rows to average into information of another row manually in R-CodePudding

I am performing some manual data cleaning and I want to expedite matters. I have a dataframe such as the following:

ID V1 V2 V3 V4 V5 Group Categorical_1 Categorical_2 Categorical_3
A   1 2  2  2  2  B21       [Text]      [Text]        [Text]
B   2 3  5  8  3  B24       [Text]      [Text]        [Text]
C   3 4  6  1  1  A28       [Text]      [Text]        [Text]
D   4 1  5  7  8  A61       [Text]      [Text]        [Text]
E   5 2  8  3  2  A62       [Text]      [Text]        [Text]
F   6 3  9  5  9  C91       [Text]      [Text]        [Text]
G   7 4  1  4  6  D90       [Text]      [Text]        [Text]

Suppose I want to create new rows for IDs H and I. H's group would be B22. I want all of its variables to be averages of the variables for other units in groups starting with B2. ID I, should be group A64, and its variables should be an average of all units in groups starting starting with A6. The categorical variables are text that I do not want copied or averaged to the new rows IDs. The output would therefore look like the following:

ID V1 V2 V3 V4 V5 Group Categorical_1 Categorical_2 Categorical_3
A   1 2  2  2  2  B21       [Text]      [Text]        [Text]
B   2 3  5  8  3  B24       [Text]      [Text]        [Text]
C   3 4  6  1  1  A28       [Text]      [Text]        [Text]
D   4 1  5  7  8  A61       [Text]      [Text]        [Text]
E   5 2  8  3  2  A62       [Text]      [Text]        [Text]
F   6 3  9  5  9  C91       [Text]      [Text]        [Text]
G   7 4  1  4  6  D90       [Text]      [Text]        [Text]
H   1.5 2.5 3.5 5 2.5 B22       
I   4.5 1.5 6.5 5 5 A64

I need to repeat this manually for many different observations with different IDs, so how can I set up the code to be readily adaptable as I repeat this process? Preferably it would be great to do this in dplyr but any suggestions would be welcome.

CodePudding user response：

You can create a little function, and call that function when you want a new row:

library(dplyr)

f <- function(id,grp,grp_pattern,data,cols="^V") {
  m = data %>% filter(grepl(grp_pattern,Group)) %>% summarize(across(matches(cols), mean))
  bind_rows(data, bind_cols(tibble(ID=id, Group=grp), m))
}

df = f("H", "B22", "^B2", df)
df = f("I", "A64", "^A6", df)
df = f("W", "IPG2", "^IPG",df)

Output:

   ID  V1  V2  V3  V4  V5 Group Categorical_1 Categorical_2 Categorical_3
1   A 1.0 2.0 2.0   2 2.0   B21        [Text]        [Text]        [Text]
2   B 2.0 3.0 5.0   8 3.0   B24        [Text]        [Text]        [Text]
3   C 3.0 4.0 6.0   1 1.0   A28        [Text]        [Text]        [Text]
4   D 4.0 1.0 5.0   7 8.0   A61        [Text]        [Text]        [Text]
5   E 5.0 2.0 8.0   3 2.0   A62        [Text]        [Text]        [Text]
6   F 6.0 3.0 9.0   5 9.0   C91        [Text]        [Text]        [Text]
7   G 7.0 4.0 1.0   4 6.0   D90        [Text]        [Text]        [Text]
8   H 1.5 2.5 3.5   5 2.5   B22          <NA>          <NA>          <NA>
9   I 4.5 1.5 6.5   5 5.0   A64          <NA>          <NA>          <NA>
10  W NaN NaN NaN NaN NaN  IPG2          <NA>          <NA>          <NA>

structure(list(ID = c("A", "B", "C", "D", "E", "F", "G"), V1 = 1:7, 
    V2 = c(2L, 3L, 4L, 1L, 2L, 3L, 4L), V3 = c(2L, 5L, 6L, 5L, 
    8L, 9L, 1L), V4 = c(2L, 8L, 1L, 7L, 3L, 5L, 4L), V5 = c(2L, 
    3L, 1L, 8L, 2L, 9L, 6L), Group = c("B21", "B24", "A28", "A61", 
    "A62", "C91", "D90"), Categorical_1 = c("[Text]", "[Text]", 
    "[Text]", "[Text]", "[Text]", "[Text]", "[Text]"), Categorical_2 = c("[Text]", 
    "[Text]", "[Text]", "[Text]", "[Text]", "[Text]", "[Text]"
    ), Categorical_3 = c("[Text]", "[Text]", "[Text]", "[Text]", 
    "[Text]", "[Text]", "[Text]")), class = "data.frame", row.names = c(NA, 
-7L))

CodePudding user response：

An option is to filter the rows where Group starts with 'A6' or 'B2', then grouped by the substring of 'Group', get the mean of numeric columns and bind with the original data

library(stringr)
library(dplyr)
df1  %>% 
  filter(str_detect(Group, "^(A6|B2|IPG2)")) %>% 
  group_by(Group = str_replace(Group, "^([A-Z] \\d)\\d ", "\\1")) %>%
  summarise(across(where(is.numeric), mean), .groups = 'drop') %>% 
  mutate(ID = c("H", "I"), .before = 1) %>%
  mutate(Group = str_c(Group, c("4", "2"))) %>% 
  bind_rows(df1, .)

-output

 ID  V1  V2  V3 V4  V5 Group Categorical_1 Categorical_2 Categorical_3
1  A 1.0 2.0 2.0  2 2.0   B21        [Text]        [Text]        [Text]
2  B 2.0 3.0 5.0  8 3.0   B24        [Text]        [Text]        [Text]
3  C 3.0 4.0 6.0  1 1.0   A28        [Text]        [Text]        [Text]
4  D 4.0 1.0 5.0  7 8.0   A61        [Text]        [Text]        [Text]
5  E 5.0 2.0 8.0  3 2.0   A62        [Text]        [Text]        [Text]
6  F 6.0 3.0 9.0  5 9.0   C91        [Text]        [Text]        [Text]
7  G 7.0 4.0 1.0  4 6.0   D90        [Text]        [Text]        [Text]
8  H 4.5 1.5 6.5  5 5.0   A64          <NA>          <NA>          <NA>
9  I 1.5 2.5 3.5  5 2.5   B22          <NA>          <NA>          <NA>

data

df1 <- structure(list(ID = c("A", "B", "C", "D", "E", "F", "G"), V1 = 1:7, 
    V2 = c(2L, 3L, 4L, 1L, 2L, 3L, 4L), V3 = c(2L, 5L, 6L, 5L, 
    8L, 9L, 1L), V4 = c(2L, 8L, 1L, 7L, 3L, 5L, 4L), V5 = c(2L, 
    3L, 1L, 8L, 2L, 9L, 6L), Group = c("B21", "B24", "A28", "A61", 
    "A62", "C91", "D90"), Categorical_1 = c("[Text]", "[Text]", 
    "[Text]", "[Text]", "[Text]", "[Text]", "[Text]"), 
Categorical_2 = c("[Text]", 
    "[Text]", "[Text]", "[Text]", "[Text]", "[Text]", "[Text]"
    ), Categorical_3 = c("[Text]", "[Text]", "[Text]", "[Text]", 
    "[Text]", "[Text]", "[Text]")), class = "data.frame", 
row.names = c(NA, 
-7L))