I have a dataset that looks something like this:
dt<- data.table(group= c(1,1,2,2,3,3),
var1=c("a","b","c","c","c","a"),
var2=c(100,150,150,120,80,100))
group var1 var2
1: 1 a 100
2: 1 b 150
3: 2 c 150
4: 2 c 120
5: 3 c 80
6: 3 a 100
Where the group variables indicate that observations are of the same group.
I would like to perform operations within columns.
I would like to do two things:
First is to create an indicator variable that identifies observations where var1 is the same within each group. So for instance, in the example dataset, the IDvar1
variable would take the value of 1 for group 2 given that var1
is the same (c) for both observations in the group.
Second, create a variable that indicates the within-group difference of var2
. so for instance for group 1 diffvar2
would take the value of 50, given that 150-100=1.
The final dataset should look something like this.
group var1 var2 IDvar1 diffvar2
1: 1 a 100 0 50
2: 1 b 150 0 50
3: 2 c 150 1 -30
4: 2 c 120 1 -30
5: 3 c 80 0 20
6: 3 a 100 0 20
Any suggestions on how to do this in a clean and efficient way? thanks a lot!
CodePudding user response:
Grouped by 'group', compare whether the number of observations (.N
) is not equal to the number of unique observations of 'var1' (uniqueN
) to create the 'IDvar1', and get the diff
erence of 'var2' to create the 'diffvar2' column
library(data.table)
dt[, c("IDvar1", "diffvar2") := .( (uniqueN(var1) != .N),
diff(var2)[1]) , by = group]
-output
> dt
group var1 var2 IDvar1 diffvar2
<num> <char> <num> <int> <num>
1: 1 a 100 0 50
2: 1 b 150 0 50
3: 2 c 150 1 -30
4: 2 c 120 1 -30
5: 3 c 80 0 20
6: 3 a 100 0 20
CodePudding user response:
@akrun has a great answer using data.table
, I figured a dplyr
solution might also be welcome (slower, but a bit more readable).
library(dplyr)
df <- as_tibble(dt)
df <- df %>%
group_by(group) %>%
mutate(
IDvar1 = if_else(length(unique(var1)) == 1, 1, 0), # if there is only one unique value of var1 within each group, IDvar1 is 1, else it is 0
diffvar2 = diff(var2)[1]
)
In general, I would be careful with diff()
, especially when grouping because this will instantly break if there aren't exactly two observations per group.