In the dataframe (tibble) from the reprex below, there are two groups of rows I would like to group by. I could almost group by doc_num
, but there are rows with different values of doc_num
but belong in the same group. Specifically, the PO-100s go with the 15s and the PO-101s go with the 17s.
I understand that my attempt below fails because I have no way to reference the doc_num
in the current row in that cur_data
call since it would think I'm referencing the doc_num
column in cur_data
. But I'm not sure of how else to proceed.
It is guaranteed that only one value of pur_ord
will be populated per group of distinct doc_num
values, i.e. per grp
.
The expected output is the same example_df
but with a new column, grp
, which I can group by later and has the values of five 15s and then six 17s (i.e. c(rep(15, 5), rep(17, 6)
)
How can I achieve this?
library(tidyverse)
set.seed(123)
example_df <- tibble(
doc_num = c(rep(15, 4), "PO-100", rep(17, 4), rep("PO-101", 2)),
pur_ord = c("Purchase Order #PO-100", rep(NA_character_, 4),
"Purchase Order #PO-101", rep(NA_character_, 5)),
x = rnorm(11),
y = sample(LETTERS, 11)
)
example_df
#> # A tibble: 11 × 4
#> doc_num pur_ord x y
#> <chr> <chr> <dbl> <chr>
#> 1 15 Purchase Order #PO-100 -0.560 Y
#> 2 15 <NA> -0.230 I
#> 3 15 <NA> 1.56 C
#> 4 15 <NA> 0.0705 H
#> 5 PO-100 <NA> 0.129 G
#> 6 17 Purchase Order #PO-101 1.72 J
#> 7 17 <NA> 0.461 Z
#> 8 17 <NA> -1.27 S
#> 9 17 <NA> -0.687 D
#> 10 PO-101 <NA> -0.446 N
#> 11 PO-101 <NA> 1.22 A
example_df %>%
mutate(
grp = case_when(
!str_starts(doc_num, "PO") ~ doc_num,
TRUE ~ cur_data() %>%
filter(str_detect(pur_ord, doc_num)) %>%
pull(doc_num)
)
)
#> Error in `mutate()`:
#> ! Problem while computing `grp = case_when(...)`.
#> Caused by error in `case_when()`:
#> ! `TRUE ~ cur_data() %>% filter(str_detect(pur_ord, doc_num)) %>%
#> pull(doc_num)` must be length 11 or one, not 0.
Created on 2022-03-04 by the reprex package (v2.0.1)
CodePudding user response:
Here's a dplyr solution that should work even if the data aren't strictly ordered.
# 1. Group by each value of "doc_num" and find the unique "pur_ord".
# 2. Copy the PO number for each row of each group.
# 3. Ensure that this value is always present.
# 4. Use the PO number as a lookup into the original "doc_num" column.
new_df <- example_df %>%
group_by(doc_num) %>%
mutate(
po_col = max(pur_ord, na.rm = T),
po_col = gsub('.*(PO-\\d )', '\\1', po_col),
po_col = ifelse(!is.na(po_col), po_col, doc_num)
) %>%
group_by(po_col) %>%
mutate(
grp = unique(doc_num[!grepl('PO', doc_num)])
) %>%
ungroup %>%
select(-po_col)
doc_num pur_ord x y grp
<chr> <chr> <dbl> <chr> <chr>
1 15 Purchase Order #PO-100 -0.560 Y 15
2 15 NA -0.230 I 15
3 15 NA 1.56 C 15
4 15 NA 0.0705 H 15
5 PO-100 NA 0.129 G 15
6 17 Purchase Order #PO-101 1.72 J 17
7 17 NA 0.461 Z 17
8 17 NA -1.27 S 17
9 17 NA -0.687 D 17
10 PO-101 NA -0.446 N 17
11 PO-101 NA 1.22 A 17
CodePudding user response:
Assuming your data is structured and ordered (i.e. the PO-XXX will always come last per desired group), you can do:
library(tidyverse)
example_df %>%
mutate(grp = if_else(str_detect(doc_num, '^PO-'), NA_character_, doc_num)) %>%
fill(grp)
# A tibble: 11 x 5
doc_num pur_ord x y grp
<chr> <chr> <dbl> <chr> <chr>
1 15 Purchase Order #PO-100 -0.560 Y 15
2 15 <NA> -0.230 I 15
3 15 <NA> 1.56 C 15
4 15 <NA> 0.0705 H 15
5 PO-100 <NA> 0.129 G 15
6 17 Purchase Order #PO-101 1.72 J 17
7 17 <NA> 0.461 Z 17
8 17 <NA> -1.27 S 17
9 17 <NA> -0.687 D 17
10 PO-101 <NA> -0.446 N 17
11 PO-101 <NA> 1.22 A 17