Home > OS >  Within column operations in R
Within column operations in R

Time:05-23

I have a dataset that looks something like this:

dt<- data.table(group= c(1,1,2,2,3,3),
var1=c("a","b","c","c","c","a"),
              var2=c(100,150,150,120,80,100))
   group var1 var2
1:     1    a  100
2:     1    b  150
3:     2    c  150
4:     2    c  120
5:     3    c   80
6:     3    a  100

Where the group variables indicate that observations are of the same group.

I would like to perform operations within columns.

I would like to do two things: First is to create an indicator variable that identifies observations where var1 is the same within each group. So for instance, in the example dataset, the IDvar1 variable would take the value of 1 for group 2 given that var1 is the same (c) for both observations in the group.

Second, create a variable that indicates the within-group difference of var2. so for instance for group 1 diffvar2 would take the value of 50, given that 150-100=1.

The final dataset should look something like this.

   group var1 var2 IDvar1 diffvar2
1:     1    a  100      0       50
2:     1    b  150      0       50
3:     2    c  150      1      -30
4:     2    c  120      1      -30
5:     3    c   80      0       20
6:     3    a  100      0       20

Any suggestions on how to do this in a clean and efficient way? thanks a lot!

CodePudding user response:

Grouped by 'group', compare whether the number of observations (.N) is not equal to the number of unique observations of 'var1' (uniqueN) to create the 'IDvar1', and get the difference of 'var2' to create the 'diffvar2' column

library(data.table)
dt[, c("IDvar1", "diffvar2") := .( (uniqueN(var1) != .N), 
       diff(var2)[1]) , by = group]

-output

> dt
   group   var1  var2 IDvar1 diffvar2
   <num> <char> <num>  <int>    <num>
1:     1      a   100      0       50
2:     1      b   150      0       50
3:     2      c   150      1      -30
4:     2      c   120      1      -30
5:     3      c    80      0       20
6:     3      a   100      0       20

CodePudding user response:

@akrun has a great answer using data.table, I figured a dplyr solution might also be welcome (slower, but a bit more readable).

library(dplyr)
df <- as_tibble(dt)

df <- df %>%
    group_by(group) %>%
    mutate(
        IDvar1 = if_else(length(unique(var1)) == 1, 1, 0), # if there is only one unique value of var1 within each group, IDvar1 is 1, else it is 0
        diffvar2 = diff(var2)[1]
    )

In general, I would be careful with diff(), especially when grouping because this will instantly break if there aren't exactly two observations per group.

  • Related