Home > OS >  Extracting sequences from columns using R
Extracting sequences from columns using R

Time:07-20

I have a df which looks like this

ID  X003-APP    X005-APP    X008-APP    X003-COP    X004-COP    X008-PIN    X009-PIN
363 NA          NA          1           0           NA          4          5
364 0           2           NA          1           5           1          5
678 0           NA          NA          5           NA          NA         NA
713 1           1           1           1           1           1          1
219 1           2           3           3           NA          4          5
234 NA          NA          NA          2           3           NA         NA
321 2           3           1           NA          NA          1          2

I am interested in minimum counts for non-null values across the column substrings APP, COP and PIN. My required output is:

ID     APP  COP PIN
363    1    1   1
364    1    1   1
678    1    1   0
713    1    1   1
219    1    1   1
234    0    1   0
321    1    0   1

For reference, I am sharing the dput():

structure(list(ID = c(363L, 364L, 678L, 713L, 219L, 234L, 321L), 
X003.APP = c(NA, 0L, 0L, 1L, 1L, NA, 2L), 
X005.APP = c(NA, 2L, NA, 1L, 2L, NA, 3L), 
X008.APP = c(1L, NA, NA, 1L, 3L, NA, 1L), 
X003.COP = c(0L, 1L, 5L, 1L, 3L, 2L, NA), 
X004.COP = c(NA, 5L, NA, 1L, NA, 3L, NA), 
X008.PIN = c(4L, 1L, NA, 1L, 4L, NA, 1L), 
X009.PIN = c(5L, 5L, NA, 1L, 5L, NA, 2L)), 
class = "data.frame", row.names = c(NA, -7L))

Edit:

Later on, I would like to analyse 2 and 3 sequences across IDs. For example, I am ultimately, interested in minimum counts for non-null values across the column substrings APP, COP and PIN. My ultimate required output for a sequence of length 2 would be:

Spec_1  Spec_2  Counts
APP     COP     5
APP     PIN     5
COP     PIN     4

Or correspondingly, my required output for a sequence of length 3 would be:

Spec_1  Spec_2  Spec_3 Counts
APP     COP     PIN    4

Is there an easy way to achieve this? It would be great to have a solution that could cater for longer sequences - even beyond 3. Thank you very much for your time.

CodePudding user response:

Try this using dplyr

library(dplyr)

df |> rowwise() |> transmute( ID,
APP = case_when(all(is.na(c_across(contains("APP")))) ~ 0 , TRUE ~ 1) ,
COP = case_when(all(is.na(c_across(contains("COP")))) ~ 0 , TRUE ~ 1) , 
PIN = case_when(all(is.na(c_across(contains("PIN")))) ~ 0 , TRUE ~ 1)) -> df1
  • output
# A tibble: 7 × 4
# Rowwise: 
     ID   APP   COP   PIN
  <int> <dbl> <dbl> <dbl>
1   363     1     1     1
2   364     1     1     1
3   678     1     1     0
4   713     1     1     1
5   219     1     1     1
6   234     0     1     0
7   321     1     0     1
  • for your second required you can use
df1 |> transmute(AC = case_when(sum(c_across(c(APP,COP))) == 2 ~ 1 , TRUE ~ 0) ,
AP = case_when(sum(c_across(c(APP,PIN))) == 2 ~ 1 , TRUE ~ 0) ,
CP = case_when(sum(c_across(c(PIN,COP))) == 2 ~ 1 , TRUE ~ 0) ,
ACP = case_when(sum(c_across(c(APP,COP,PIN))) == 3 ~ 1 , TRUE ~ 0)) |> ungroup() |> 
summarise(APP_COP = sum(AC) , APP_PIN = sum(AP) , COP_PIN = sum(CP) ,  APP_COP_PIN = sum(ACP))
  • output
# A tibble: 1 × 4
  APP_COP APP_PIN COP_PIN APP_COP_PIN
    <dbl>   <dbl>   <dbl>       <dbl>
1       5       5       4           4

CodePudding user response:

You may try

library(reshape2)
library(tidyverse)

df %>%
  reshape2::melt(id = "ID") %>%
  separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
  group_by(ID, Spec) %>%
  summarize(value = as.numeric(any(!is.na(value)))) %>%
  filter(value == 1) %>%
  pivot_wider(names_from = "Spec", values_from = "value") %>%
  replace(is.na(.), 0)

     ID   APP   COP   PIN
  <int> <dbl> <dbl> <dbl>
1   219     1     1     1
2   234     0     1     0
3   321     1     0     1
4   363     1     1     1
5   364     1     1     1
6   678     1     1     0
7   713     1     1     1

Is your edited one and

df %>%
  reshape2::melt(id = "ID") %>%
  separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
  group_by(ID, Spec) %>%
  summarize(value = any(!is.na(value))) %>%
  filter(value) %>%
  group_by(ID) %>%
  filter(n() > 1) %>%
  summarise(Spec = combn(Spec, 2, simplify = F)) %>%
  unnest_wider(Spec, names_sep = "_") %>%
  group_by(Spec_1, Spec_2) %>%
  summarize(Counts = n())

  Spec_1 Spec_2 Counts
  <chr>  <chr>   <int>
1 APP    COP         5
2 APP    PIN         5
3 COP    PIN         4

is your previous one.

3 seq?

df %>%
  reshape2::melt(id = "ID") %>%
  separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
  group_by(ID, Spec) %>%
  summarize(value = any(!is.na(value))) %>%
  filter(value) %>%
  group_by(ID) %>%
  filter(n() > 2) %>%
  summarise(Spec = combn(Spec, 3, simplify = F)) %>%
  unnest_wider(Spec, names_sep = "_") %>%
  group_by(Spec_1, Spec_2, Spec_3) %>%
  summarize(Counts = n())

  Spec_1 Spec_2 Spec_3 Counts
  <chr>  <chr>  <chr>   <int>
1 APP    COP    PIN         4
  • Related