I am performing some manual data cleaning and I want to expedite matters. I have a dataframe such as the following:
ID V1 V2 V3 V4 V5 Group Categorical_1 Categorical_2 Categorical_3
A 1 2 2 2 2 B21 [Text] [Text] [Text]
B 2 3 5 8 3 B24 [Text] [Text] [Text]
C 3 4 6 1 1 A28 [Text] [Text] [Text]
D 4 1 5 7 8 A61 [Text] [Text] [Text]
E 5 2 8 3 2 A62 [Text] [Text] [Text]
F 6 3 9 5 9 C91 [Text] [Text] [Text]
G 7 4 1 4 6 D90 [Text] [Text] [Text]
Suppose I want to create new rows for IDs H and I. H's group would be B22. I want all of its variables to be averages of the variables for other units in groups starting with B2. ID I, should be group A64, and its variables should be an average of all units in groups starting starting with A6. The categorical variables are text that I do not want copied or averaged to the new rows IDs. The output would therefore look like the following:
ID V1 V2 V3 V4 V5 Group Categorical_1 Categorical_2 Categorical_3
A 1 2 2 2 2 B21 [Text] [Text] [Text]
B 2 3 5 8 3 B24 [Text] [Text] [Text]
C 3 4 6 1 1 A28 [Text] [Text] [Text]
D 4 1 5 7 8 A61 [Text] [Text] [Text]
E 5 2 8 3 2 A62 [Text] [Text] [Text]
F 6 3 9 5 9 C91 [Text] [Text] [Text]
G 7 4 1 4 6 D90 [Text] [Text] [Text]
H 1.5 2.5 3.5 5 2.5 B22
I 4.5 1.5 6.5 5 5 A64
I need to repeat this manually for many different observations with different IDs, so how can I set up the code to be readily adaptable as I repeat this process? Preferably it would be great to do this in dplyr
but any suggestions would be welcome.
CodePudding user response:
You can create a little function, and call that function when you want a new row:
library(dplyr)
f <- function(id,grp,grp_pattern,data,cols="^V") {
m = data %>% filter(grepl(grp_pattern,Group)) %>% summarize(across(matches(cols), mean))
bind_rows(data, bind_cols(tibble(ID=id, Group=grp), m))
}
df = f("H", "B22", "^B2", df)
df = f("I", "A64", "^A6", df)
df = f("W", "IPG2", "^IPG",df)
Output:
ID V1 V2 V3 V4 V5 Group Categorical_1 Categorical_2 Categorical_3
1 A 1.0 2.0 2.0 2 2.0 B21 [Text] [Text] [Text]
2 B 2.0 3.0 5.0 8 3.0 B24 [Text] [Text] [Text]
3 C 3.0 4.0 6.0 1 1.0 A28 [Text] [Text] [Text]
4 D 4.0 1.0 5.0 7 8.0 A61 [Text] [Text] [Text]
5 E 5.0 2.0 8.0 3 2.0 A62 [Text] [Text] [Text]
6 F 6.0 3.0 9.0 5 9.0 C91 [Text] [Text] [Text]
7 G 7.0 4.0 1.0 4 6.0 D90 [Text] [Text] [Text]
8 H 1.5 2.5 3.5 5 2.5 B22 <NA> <NA> <NA>
9 I 4.5 1.5 6.5 5 5.0 A64 <NA> <NA> <NA>
10 W NaN NaN NaN NaN NaN IPG2 <NA> <NA> <NA>
structure(list(ID = c("A", "B", "C", "D", "E", "F", "G"), V1 = 1:7,
V2 = c(2L, 3L, 4L, 1L, 2L, 3L, 4L), V3 = c(2L, 5L, 6L, 5L,
8L, 9L, 1L), V4 = c(2L, 8L, 1L, 7L, 3L, 5L, 4L), V5 = c(2L,
3L, 1L, 8L, 2L, 9L, 6L), Group = c("B21", "B24", "A28", "A61",
"A62", "C91", "D90"), Categorical_1 = c("[Text]", "[Text]",
"[Text]", "[Text]", "[Text]", "[Text]", "[Text]"), Categorical_2 = c("[Text]",
"[Text]", "[Text]", "[Text]", "[Text]", "[Text]", "[Text]"
), Categorical_3 = c("[Text]", "[Text]", "[Text]", "[Text]",
"[Text]", "[Text]", "[Text]")), class = "data.frame", row.names = c(NA,
-7L))
CodePudding user response:
An option is to filter
the rows where Group starts with 'A6' or 'B2', then grouped by the substring
of 'Group', get the mean
of numeric
columns and bind with the original data
library(stringr)
library(dplyr)
df1 %>%
filter(str_detect(Group, "^(A6|B2|IPG2)")) %>%
group_by(Group = str_replace(Group, "^([A-Z] \\d)\\d ", "\\1")) %>%
summarise(across(where(is.numeric), mean), .groups = 'drop') %>%
mutate(ID = c("H", "I"), .before = 1) %>%
mutate(Group = str_c(Group, c("4", "2"))) %>%
bind_rows(df1, .)
-output
ID V1 V2 V3 V4 V5 Group Categorical_1 Categorical_2 Categorical_3
1 A 1.0 2.0 2.0 2 2.0 B21 [Text] [Text] [Text]
2 B 2.0 3.0 5.0 8 3.0 B24 [Text] [Text] [Text]
3 C 3.0 4.0 6.0 1 1.0 A28 [Text] [Text] [Text]
4 D 4.0 1.0 5.0 7 8.0 A61 [Text] [Text] [Text]
5 E 5.0 2.0 8.0 3 2.0 A62 [Text] [Text] [Text]
6 F 6.0 3.0 9.0 5 9.0 C91 [Text] [Text] [Text]
7 G 7.0 4.0 1.0 4 6.0 D90 [Text] [Text] [Text]
8 H 4.5 1.5 6.5 5 5.0 A64 <NA> <NA> <NA>
9 I 1.5 2.5 3.5 5 2.5 B22 <NA> <NA> <NA>
data
df1 <- structure(list(ID = c("A", "B", "C", "D", "E", "F", "G"), V1 = 1:7,
V2 = c(2L, 3L, 4L, 1L, 2L, 3L, 4L), V3 = c(2L, 5L, 6L, 5L,
8L, 9L, 1L), V4 = c(2L, 8L, 1L, 7L, 3L, 5L, 4L), V5 = c(2L,
3L, 1L, 8L, 2L, 9L, 6L), Group = c("B21", "B24", "A28", "A61",
"A62", "C91", "D90"), Categorical_1 = c("[Text]", "[Text]",
"[Text]", "[Text]", "[Text]", "[Text]", "[Text]"),
Categorical_2 = c("[Text]",
"[Text]", "[Text]", "[Text]", "[Text]", "[Text]", "[Text]"
), Categorical_3 = c("[Text]", "[Text]", "[Text]", "[Text]",
"[Text]", "[Text]", "[Text]")), class = "data.frame",
row.names = c(NA,
-7L))