I have a data frame df
.
ID 001-alpha 001-sigma 001-delta 001-gamma 002-alpha 002-beta 002-charlie 002-delta
12 1 2 3 3 2 NA 2 0
21 NA 1 3 NA 3 2 2 NA
24 1 2 NA 3 3 1 3 2
36 NA 2 3 4 2 2 NA 1
44 NA 1 2 3 1 2 1 2
57 0 NA 2 3 1 1 3 1
In this df
, I have several columns:
ID
field represents the product ID, which is uniqueColumn names that begin with
001-
or002-
represent some specifications for those products
Requirement
I am interested in checking for each pair of product specifications, how many IDs are common i.e., they contain non-null or non NA
values.
To explain further, if I take the first two columns (001-alpha
and 001-sigma
) as an example. Here, ID = 12 and ID = 24 is common or non-null between these columns. For the remaining IDs, there are missing values in one of the columns. So my required output column 001-alpha-sigma
will have a value 2
.
Similarly, for the first and the third columns (001-alpha
and 001-delta
), again there are only two common IDs (i.e., ID = 12 and ID = 57) that contain non-null values across these two columns. So the output column 001-alpha-delta
will contain 2
.
My required output is as below:
001-alpha-sigma 001-alpha-delta 001-alpha-gamma 001-sigma-delta 001-sigma-gamma 001-delta-gamma 002-alpha-beta 002-alpha-charlie 002-alpha-delta 002-beta-charlie 002-beta-delta 002-charlie-delta
2 2 3 4 4 4 5 5 5 4 4 4
Is there are an easy approach to extract this information? e.g., can we use dplyr for this?
I have looked at a few other posts but they do not meet my requirement. For example,
Sum columns by group (row names) in a matrix
For convenience, I am also sharing dput(df)
below.
structure(list
(ID = structure(1:6, .Label = c("12", "21", "24", "36", "44", "57"),
class = "factor"),
`001.alpha` = c(1L, NA, 1L, NA, NA, 0L),
`001.sigma` = c(2L, 1L, 2L, 2L, 1L, NA),
`001.delta` = c(3L, 3L, NA, 3L, 2L, 2L),
`001.gamma` = c(3L, NA, 3L, 4L, 3L, 3L),
`002.alpha` = c(2L, 3L, 3L, 2L, 1L, 1L),
`002.beta` = c(NA,2L, 1L, 2L, 2L, 1L),
`002.charlie` = c(2L, 2L, 3L, NA, 1L,3L),
`002.delta` = c(0L, NA, 2L, 1L, 2L, 1L)),
row.names = c(NA, -6L), class = "data.frame")
Thank you to your help!
CodePudding user response:
This question involves some simple tidy-select work based on all pairs of the two column names. I begin by make a ones and twos subset of the dataframe. Then in the main function, I start by creating a dataframe of all of the possible combinations of two names from the input. I mutate to create a column of the original data selected to have only the two columns of the specified row, then drop_na and count the rows of each subset data frame. The procedure is finished with some name manipulation and pivoting.
Let me know if this works:
library(tidyverse)
df <- structure(list(ID = structure(1:6, .Label = c("12", "21", "24", "36", "44", "57"),
class = "factor"),
`001.a.alpha` = c(1L, NA, 1L, NA, NA, 0L), `001.b.sigma` = c(2L, 1L, 2L, 2L, 1L, NA), `001.c.delta` = c(3L, 3L, NA, 3L, 2L, 2L), `001.e.gamma` = c(3L, NA, 3L, 4L, 3L, 3L), `002.ff.alpha` = c(2L, 3L, 3L, 2L, 1L, 1L), `002.ef.beta` = c(NA,2L, 1L, 2L, 2L, 1L), `002.gh.charlie` = c(2L, 2L, 3L, NA, 1L,3L), `002.vv.delta` = c(0L, NA, 2L, 1L, 2L, 1L)), row.names = c(NA, -6L), class = "data.frame") %>%
as_tibble()
ones <- select(df, contains("001"))
twos <- select(df, contains("002"))
main <- function(tib, df) {
t(combn(colnames(tib), 2)) %>%
as_tibble(.name_repair = ~c("first_var", "second_var")) %>%
rowwise() %>%
mutate(contains_both = list(select(df, all_of(c(first_var, second_var)))),
result = nrow(drop_na(contains_both)),
combined_name = (str_c(str_extract(first_var, "^\\d "),
str_extract(first_var, "[:alpha:] $"),
str_extract(second_var, "[:alpha:] $"),
sep = "-"))) %>%
select(combined_name, result) %>%
pivot_wider(names_from = combined_name, values_from = result)
}
bind_cols(
main(ones, df),
main(twos, df))