Home > OS >  r group_by and matching for pairs of columns
r group_by and matching for pairs of columns

Time:07-16

I have a data frame df.

ID  001-alpha   001-sigma   001-delta   001-gamma   002-alpha   002-beta    002-charlie 002-delta
12  1           2           3           3           2               NA          2               0
21  NA          1           3           NA          3               2           2               NA
24  1           2           NA          3           3               1           3               2
36  NA          2           3           4           2               2           NA              1
44  NA          1           2           3           1               2           1               2
57  0           NA          2           3           1               1           3               1

In this df, I have several columns:

  1. ID field represents the product ID, which is unique

  2. Column names that begin with 001- or 002- represent some specifications for those products

Requirement

I am interested in checking for each pair of product specifications, how many IDs are common i.e., they contain non-null or non NA values.

To explain further, if I take the first two columns (001-alpha and 001-sigma) as an example. Here, ID = 12 and ID = 24 is common or non-null between these columns. For the remaining IDs, there are missing values in one of the columns. So my required output column 001-alpha-sigma will have a value 2.

Similarly, for the first and the third columns (001-alpha and 001-delta), again there are only two common IDs (i.e., ID = 12 and ID = 57) that contain non-null values across these two columns. So the output column 001-alpha-delta will contain 2.

My required output is as below:

001-alpha-sigma 001-alpha-delta 001-alpha-gamma 001-sigma-delta 001-sigma-gamma 001-delta-gamma 002-alpha-beta  002-alpha-charlie   002-alpha-delta 002-beta-charlie    002-beta-delta  002-charlie-delta
2               2               3               4               4               4               5               5                   5               4                   4               4

Is there are an easy approach to extract this information? e.g., can we use dplyr for this?

I have looked at a few other posts but they do not meet my requirement. For example,

Sum columns by group (row names) in a matrix

For convenience, I am also sharing dput(df) below.

structure(list
(ID = structure(1:6, .Label = c("12", "21", "24", "36", "44", "57"), 
class = "factor"), 
`001.alpha` = c(1L, NA, 1L, NA, NA, 0L), 
`001.sigma` = c(2L, 1L, 2L, 2L, 1L, NA), 
`001.delta` = c(3L, 3L, NA, 3L, 2L, 2L), 
`001.gamma` = c(3L, NA, 3L, 4L, 3L, 3L), 
`002.alpha` = c(2L, 3L, 3L, 2L, 1L, 1L), 
`002.beta` = c(NA,2L, 1L, 2L, 2L, 1L), 
`002.charlie` = c(2L, 2L, 3L, NA, 1L,3L), 
`002.delta` = c(0L, NA, 2L, 1L, 2L, 1L)), 
row.names = c(NA, -6L), class = "data.frame")

Thank you to your help!

CodePudding user response:

This question involves some simple tidy-select work based on all pairs of the two column names. I begin by make a ones and twos subset of the dataframe. Then in the main function, I start by creating a dataframe of all of the possible combinations of two names from the input. I mutate to create a column of the original data selected to have only the two columns of the specified row, then drop_na and count the rows of each subset data frame. The procedure is finished with some name manipulation and pivoting.

Let me know if this works:

library(tidyverse)
df <- structure(list(ID = structure(1:6, .Label = c("12", "21", "24", "36", "44", "57"), 
class = "factor"), 
`001.a.alpha` = c(1L, NA, 1L, NA, NA, 0L), `001.b.sigma` = c(2L, 1L, 2L, 2L, 1L, NA), `001.c.delta` = c(3L, 3L, NA, 3L, 2L, 2L), `001.e.gamma` = c(3L, NA, 3L, 4L, 3L, 3L), `002.ff.alpha` = c(2L, 3L, 3L, 2L, 1L, 1L), `002.ef.beta` = c(NA,2L, 1L, 2L, 2L, 1L), `002.gh.charlie` = c(2L, 2L, 3L, NA, 1L,3L), `002.vv.delta` = c(0L, NA, 2L, 1L, 2L, 1L)), row.names = c(NA, -6L), class = "data.frame") %>%
  as_tibble()

ones <- select(df, contains("001"))
twos <- select(df, contains("002"))

main <- function(tib, df) {
  t(combn(colnames(tib), 2)) %>%
    as_tibble(.name_repair = ~c("first_var", "second_var")) %>%
    rowwise() %>%
    mutate(contains_both = list(select(df, all_of(c(first_var, second_var)))),
           result = nrow(drop_na(contains_both)),
           combined_name = (str_c(str_extract(first_var, "^\\d "),
                                  str_extract(first_var, "[:alpha:] $"),
                                  str_extract(second_var, "[:alpha:] $"),
                                  sep = "-"))) %>%
    select(combined_name, result) %>%
    pivot_wider(names_from = combined_name, values_from = result)
}
bind_cols(
  main(ones, df),
  main(twos, df))
  • Related