I would like to create a new column with an ID code based on multiple conditions of several columns. This is a sample of my data.
pat N C NC n1 c1
1 1 1 1 1 FALSE FALSE
2 2 1 1 1 FALSE FALSE
3 3 12 31 2 FALSE FALSE
4 4 12 31 2 FALSE FALSE
5 5 3 15 3 FALSE TRUE
6 6 4 15 4 FALSE TRUE
7 7 5 18 5 TRUE FALSE
8 8 5 20 6 TRUE FALSE
9 9 6 21 7 FALSE FALSE
10 10 7 21 8 FALSE FALSE
11 11 8 19 9 FALSE FALSE
12 12 9 11 10 FALSE FALSE
13 13 10 11 11 FALSE FALSE
14 14 11 14 12 FALSE FALSE
sample <- data.frame(pat = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14),
N = c(1,1,12,12,3,4,5,5,6,7,8,9,10,11),
C = c(1,1,31,31,15,15,18,20,21,21,19,11,11,14),
NC = c(1,1,2,2,3,4,5,6,7,8,9,10,11,12),
n1 = c("FALSE", "FALSE","FALSE", "FALSE", "FALSE", "FALSE","TRUE","TRUE","FALSE","FALSE", "FALSE","FALSE", "FALSE", "FALSE"),
c1 = c("FALSE", "FALSE","FALSE", "FALSE", "TRUE", "TRUE","FALSE","FALSE","FALSE","FALSE", "FALSE","FALSE", "FALSE", "FALSE"))
Column N/C/NC are integer class. Column n1/c1 are logical. If a row has the same number for column N and C, then column NC assigned a duplicate number.
Now i would like to create a new column of ascending numbers, but mark rows with the same number if
- column NC is duplicate, or
- column N is duplicate and column n1 is FALSE, or
- column C is duplicate and column c1 is FALSE.
If column N or C is duplicate and n1 or c1 is TRUE, then it has to show a unique number.
I would like to have a desired output like this
pat N C NC n1 c1 new_ID
1 1 1 1 1 FALSE FALSE 1
2 2 1 1 1 FALSE FALSE 1
3 3 12 31 2 FALSE FALSE 2
4 4 12 31 2 FALSE FALSE 2
5 5 3 15 3 FALSE TRUE 3
6 6 4 15 4 FALSE TRUE 4
7 7 5 18 5 TRUE FALSE 5
8 8 5 20 6 TRUE FALSE 6
9 9 6 21 7 FALSE FALSE 7
10 10 7 21 8 FALSE FALSE 7
11 11 8 19 9 FALSE FALSE 8
12 12 9 11 10 FALSE FALSE 9
13 13 10 11 11 FALSE FALSE 9
14 14 11 14 12 FALSE FALSE 10
I know there have been several question about assigning new ID columns, but i couldt find one based on multiple (similar) conditions. Therefor i would like to ask this question here. Many thanks in advance.
CodePudding user response:
This code makes it possible to create an unique ID based on multiple variables.
sample <- data.table::as.data.table(sample)[
j = new_ID := base::as.numeric(base::interaction(var1, var..., varn,
drop=TRUE))
]
Since your explanation is not really clear to me, I let you try this by yourself. I believe you have to create a new variable/column for each condition, then put those variables into the code.
CodePudding user response:
Here is one option:
sample <- data.frame(pat = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14),
N = c(1,1,12,12,3,4,5,5,6,7,8,9,10,11),
C = c(1,1,31,31,15,15,18,20,21,21,19,11,11,14),
NC = c(1,1,2,2,3,4,5,6,7,8,9,10,11,12),
n1 = c(FALSE, FALSE,FALSE, FALSE, FALSE, FALSE,TRUE,TRUE,FALSE,FALSE, FALSE,FALSE, FALSE, FALSE),
c1 = c(FALSE, FALSE,FALSE, FALSE, TRUE, TRUE,FALSE,FALSE,FALSE,FALSE, FALSE,FALSE, FALSE, FALSE))
sample <- sample[order(sample$NC, sample$N, sample$C), ]
id <- 1
sample[1, 'new_ID'] <- id
for (i in 2:nrow(sample)) {
if (((sample[i, 'NC'] != sample[i - 1, 'NC']) | sample[i - 1, 'n1']) &
((sample[i, 'C'] != sample[i - 1, 'C']) | sample[i - 1, 'c1'])) {
id <- id 1
}
sample[i, 'new_ID'] <- id
}
> sample
pat N C NC n1 c1 new_ID
1 1 1 1 1 FALSE FALSE 1
2 2 1 1 1 FALSE FALSE 1
3 3 12 31 2 FALSE FALSE 2
4 4 12 31 2 FALSE FALSE 2
5 5 3 15 3 FALSE TRUE 3
6 6 4 15 4 FALSE TRUE 4
7 7 5 18 5 TRUE FALSE 5
8 8 5 20 6 TRUE FALSE 6
9 9 6 21 7 FALSE FALSE 7
10 10 7 21 8 FALSE FALSE 7
11 11 8 19 9 FALSE FALSE 8
12 12 9 11 10 FALSE FALSE 9
13 13 10 11 11 FALSE FALSE 9
14 14 11 14 12 FALSE FALSE 10
If it weren't for the n1
/c1
constraints, you should be able to use as.integer(factor(...)
where ...
= a paste
or interaction
call with the variables needed.
But with n1
and c1
, all I could think of was a loop. This requires sorting first! And note that you had quotes around TRUE
and FALSE
, which I removed.