I have a df
which looks like this
ID X003-APP X005-APP X008-APP X003-COP X004-COP X008-PIN X009-PIN
363 NA NA 1 0 NA 4 5
364 0 2 NA 1 5 1 5
678 0 NA NA 5 NA NA NA
713 1 1 1 1 1 1 1
219 1 2 3 3 NA 4 5
234 NA NA NA 2 3 NA NA
321 2 3 1 NA NA 1 2
I am interested in minimum counts for non-null values across the column substrings APP
, COP
and PIN
. My required output is:
ID APP COP PIN
363 1 1 1
364 1 1 1
678 1 1 0
713 1 1 1
219 1 1 1
234 0 1 0
321 1 0 1
For reference, I am sharing the dput():
structure(list(ID = c(363L, 364L, 678L, 713L, 219L, 234L, 321L),
X003.APP = c(NA, 0L, 0L, 1L, 1L, NA, 2L),
X005.APP = c(NA, 2L, NA, 1L, 2L, NA, 3L),
X008.APP = c(1L, NA, NA, 1L, 3L, NA, 1L),
X003.COP = c(0L, 1L, 5L, 1L, 3L, 2L, NA),
X004.COP = c(NA, 5L, NA, 1L, NA, 3L, NA),
X008.PIN = c(4L, 1L, NA, 1L, 4L, NA, 1L),
X009.PIN = c(5L, 5L, NA, 1L, 5L, NA, 2L)),
class = "data.frame", row.names = c(NA, -7L))
Edit:
Later on, I would like to analyse 2 and 3 sequences across IDs. For example, I am ultimately, interested in minimum counts for non-null values across the column substrings APP
, COP
and PIN
. My ultimate required output for a sequence of length 2 would be:
Spec_1 Spec_2 Counts
APP COP 5
APP PIN 5
COP PIN 4
Or correspondingly, my required output for a sequence of length 3 would be:
Spec_1 Spec_2 Spec_3 Counts
APP COP PIN 4
Is there an easy way to achieve this? It would be great to have a solution that could cater for longer sequences - even beyond 3. Thank you very much for your time.
CodePudding user response:
Try this using dplyr
library(dplyr)
df |> rowwise() |> transmute( ID,
APP = case_when(all(is.na(c_across(contains("APP")))) ~ 0 , TRUE ~ 1) ,
COP = case_when(all(is.na(c_across(contains("COP")))) ~ 0 , TRUE ~ 1) ,
PIN = case_when(all(is.na(c_across(contains("PIN")))) ~ 0 , TRUE ~ 1)) -> df1
- output
# A tibble: 7 × 4
# Rowwise:
ID APP COP PIN
<int> <dbl> <dbl> <dbl>
1 363 1 1 1
2 364 1 1 1
3 678 1 1 0
4 713 1 1 1
5 219 1 1 1
6 234 0 1 0
7 321 1 0 1
- for your second required you can use
df1 |> transmute(AC = case_when(sum(c_across(c(APP,COP))) == 2 ~ 1 , TRUE ~ 0) ,
AP = case_when(sum(c_across(c(APP,PIN))) == 2 ~ 1 , TRUE ~ 0) ,
CP = case_when(sum(c_across(c(PIN,COP))) == 2 ~ 1 , TRUE ~ 0) ,
ACP = case_when(sum(c_across(c(APP,COP,PIN))) == 3 ~ 1 , TRUE ~ 0)) |> ungroup() |>
summarise(APP_COP = sum(AC) , APP_PIN = sum(AP) , COP_PIN = sum(CP) , APP_COP_PIN = sum(ACP))
- output
# A tibble: 1 × 4
APP_COP APP_PIN COP_PIN APP_COP_PIN
<dbl> <dbl> <dbl> <dbl>
1 5 5 4 4
CodePudding user response:
You may try
library(reshape2)
library(tidyverse)
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = as.numeric(any(!is.na(value)))) %>%
filter(value == 1) %>%
pivot_wider(names_from = "Spec", values_from = "value") %>%
replace(is.na(.), 0)
ID APP COP PIN
<int> <dbl> <dbl> <dbl>
1 219 1 1 1
2 234 0 1 0
3 321 1 0 1
4 363 1 1 1
5 364 1 1 1
6 678 1 1 0
7 713 1 1 1
Is your edited one and
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = any(!is.na(value))) %>%
filter(value) %>%
group_by(ID) %>%
filter(n() > 1) %>%
summarise(Spec = combn(Spec, 2, simplify = F)) %>%
unnest_wider(Spec, names_sep = "_") %>%
group_by(Spec_1, Spec_2) %>%
summarize(Counts = n())
Spec_1 Spec_2 Counts
<chr> <chr> <int>
1 APP COP 5
2 APP PIN 5
3 COP PIN 4
is your previous one.
3 seq?
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = any(!is.na(value))) %>%
filter(value) %>%
group_by(ID) %>%
filter(n() > 2) %>%
summarise(Spec = combn(Spec, 3, simplify = F)) %>%
unnest_wider(Spec, names_sep = "_") %>%
group_by(Spec_1, Spec_2, Spec_3) %>%
summarize(Counts = n())
Spec_1 Spec_2 Spec_3 Counts
<chr> <chr> <chr> <int>
1 APP COP PIN 4