Home > Software engineering >  dplyr: streamline creating matching and absolute difference variables
dplyr: streamline creating matching and absolute difference variables

Time:10-08

I have a dataset of friendships and characteristics of each individual, I'm trying to create variables that are if they match on the binary measures and what their absolute difference is for the continuous measures.

I can do this easily, but I was wondering if there is a different way to do it that is more streamlined than my method given that I have ~60 variables to do this with.

Sample Data:

dat <- read.table(text = "id.x  id.y    male.x  smoke.x drink.x everfight.x grades.x    male.y  smoke.y drink.y everfight.y grades.y
1   6   0   2   4   1   3   0   2   1   0   2
2   7   0   2   4   0   5   0   2   3   1   4
3   8   1   4   4   1   2   0   4   2   1   1
4   9   0   2   3   1   2   0   3   2   0   1
5   10  1   2   4   0   4   1   4   1   0   4", header = TRUE)

Here's what I've done:

dat <- dat %>%
       mutate(sex_match = case_when(male.x == male.y ~ 1,
                                    TRUE ~ 0),
              fight_match = case_when(everfight.x == everfight.y ~ 1,
                                      TRUE ~ 0),
              smoke_diff   = abs(smoke.x  - smoke.y),
              drink_diff   = abs(drink.x  - drink.y),
              grades_diff  = abs(grades.x - grades.y))

This gives me exactly what I want:

id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y sex_match fight_match smoke_diff drink_diff grades_diff
 1    6      0       2       4           1        3      0       2       1           0        2         1           0          0          3           1
 2    7      0       2       4           0        5      0       2       3           1        4         1           0          0          1           1
 3    8      1       4       4           1        2      0       4       2           1        1         0           1          0          2           1
 4    9      0       2       3           1        2      0       3       2           0        1         1           0          1          1           1
 5   10      1       2       4           0        4      1       4       1           0        4         1           1          2          3           0

However, I'm wondering if there's a way to do this with a loop or apply that identifies corresponding vairables and create the matching and absolute difference new variables in the sample output above.

UPDATE

Ended up using most of what Jon answered with and one part of akrun, here's what worked best for me:

non_binary <- dat %>% select(., contains(".x")) %>%
                      select(., -id.x) %>%
                      select_if(~!all(. %in% 0:1)) %>% 
                      rename_with(~str_remove(., '.x')) %>%
                      names()
dat %>%
  pivot_longer(-c(id.x:id.y), 
               names_to = c("var", ".value"),
               names_pattern = "(. ).(. )") %>%
  mutate(match = if_else(var %in% non_binary, abs(x - y), 1L * (x == y))) %>%
  mutate(col_name = paste(var, ifelse(var %in% non_binary, "diff", "match"), sep = "_")) %>%
  select(-c(var:y)) %>%
  pivot_wider(names_from = col_name, values_from = match)

Thanks to both of you!

CodePudding user response:

Here's a tidyr/dplyr approach. First I reshape to a long format with a row for each id/variable combination, with columns for each version. Then I can compare those for every pair at once, and reshape wide.

library(dplyr); library(tidyr)
non_binary <- c("smoke", "drink", "grades")
dat %>%
  pivot_longer(-c(id.x:id.y), 
               names_to = c("var", ".value"),
               names_pattern = "(. ).(. )") %>%
  mutate(match = if_else(var %in% non_binary, abs(x - y), 1L * (x == y))) %>%
  mutate(col_name = paste(var, ifelse(var %in% non_binary, "diff", "match"), sep = "_")) %>%
  select(-c(var:y)) %>%
  pivot_wider(names_from = col_name, values_from = match)

Result, which could be appended to original data:

# A tibble: 5 x 7
   id.x  id.y male_match smoke_diff drink_diff everfight_match grades_diff
  <int> <int>      <int>      <int>      <int>           <int>       <int>
1     1     6          1          0          3               0           1
2     2     7          1          0          1               0           1
3     3     8          0          0          2               1           1
4     4     9          1          1          1               0           1
5     5    10          1          2          3               1           0

CodePudding user response:

We can use tidyverse with across which can do this with dplyr/stringr packages alone i.e. loop across the .x columns of 'male', 'everfight', then get the value of the corresponding .y columns to create the binary column, similarly do this on the other columns, and get the absolute difference. In the .names, replace the column name by making use of str_replace

library(dplyr)
library(stringr)
dat %>% 
   mutate(across(c(male.x, everfight.x ),
      ~  (. == get(str_replace(cur_column(), 'x$', 'y'))),
       .names = "{str_replace(.col, '.x', '_match')}"), 
     across(c(smoke.x, drink.x, grades.x), 
         ~
       abs(. - get(str_replace(cur_column(), 'x$', 'y'))),
           .names = "{str_replace(.col, '.x', '_diff')}"))

-output

id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y male_match everfight_match smoke_diff drink_diff grades_diff
1    1    6      0       2       4           1        3      0       2       1           0        2          1               0          0          3           1
2    2    7      0       2       4           0        5      0       2       3           1        4          1               0          0          1           1
3    3    8      1       4       4           1        2      0       4       2           1        1          0               1          0          2           1
4    4    9      0       2       3           1        2      0       3       2           0        1          1               0          1          1           1
5    5   10      1       2       4           0        4      1       4       1           0        4          1               1          2          3           0

Or may do this in a single across as well

dat %>% 
    mutate(across(ends_with('.x'), ~ {
       other <- get(str_replace(cur_column(), 'x$', 'y'))
    if(all(. %in% c(0, 1)) )   (. == other) else abs(. - other)
       }, .names = "{str_replace(.col, '.x', '_diff')}"))
  • Related