I need to create a flagging rule. Here is a sample dataset.
df1 <- data.frame(problem = c(1,1,2,3,3,3,4),
item = c("a1","a2","a3","b1","b2","b11","c1"))
> df1
problem item
1 1 a1
2 1 a2
3 2 a3
4 3 b1
5 3 b2
6 3 b11
7 4 c1
I need to create a grouping variable based on problem
variable. Basically, when a problem has multiple items, there will be a number assigned. When the problem has only one item, I need to assign NA
. To make it clear, here is what my desired output.
> df2
problem item assign
1 Problem 1 a1 1
2 Problem 1 a2 1
3 Problem 2 a3 NA
4 Problem 3 b1 2
5 Problem 3 b2 2
6 Problem 3 b11 2
7 Problem 4 c1 NA
Since Problem 2
has only one item, we skip that, and give the order number 2
for Problem 3
.
Any thoughts on this? Thanks!
CodePudding user response:
Here is an option with data.table
- Grouped by 'problem', get the row index (.I
) where the number of rows (.N
) is greater than 1, use that column ($V1
) in the i
to subset the data, and create new column 'assign' with .GRP
grouped by 'problem'
library(data.table)
setDT(df1)[df1[, .I[.N > 1], problem]$V1, assign := .GRP, problem]
-output
> df1
problem item assign
<num> <char> <int>
1: 1 a1 1
2: 1 a2 1
3: 2 a3 NA
4: 3 b1 2
5: 3 b2 2
6: 3 b11 2
7: 4 c1 NA
CodePudding user response:
Here's a dplyr
option:
library(dplyr)
df1 %>%
add_count(problem, name = "n_prob") %>%
mutate(assign = as.integer(factor(ifelse(n_prob > 1, problem, NA)))) %>%
select(-n_prob)
# # A tibble: 7 × 3
# problem item assign
# <dbl> <chr> <int>
# 1 1 a1 1
# 2 1 a2 1
# 3 2 a3 NA
# 4 3 b1 2
# 5 3 b2 2
# 6 3 b11 2
# 7 4 c1 NA