Home > Net >  How to create a column marking if values are present in other columns in r?
How to create a column marking if values are present in other columns in r?

Time:08-09

I have 5 columns of 5 different p-values. I am looking to mark in a new column when values in any of the 5 columns are present.

My data looks like this:

Gene   pvalue_group1 pvalue_group2 pvalue_group3 pvalue_group4 pvalue_group5
Gene1     0.01           0.2             NA           NA            NA
Gene2      NA            0.001           NA           NA            NA
Gene3     0.05           0.7             0.09         0.2           0.001  

I am looking to output a column that would look like this:

Gene   pvalue_group1 pvalue_group2 pvalue_group3 pvalue_group4 pvalue_group5 Group
Gene1     0.01           0.2             NA           NA            NA       Group1, Group2
Gene2      NA            0.001           NA           NA            NA       Group2
Gene3     0.05           0.7             0.09         0.2           0.001    Group1, Group2, Group3, Group4, Group5

I've tried to create code based off similar questions, trying to use dplyr and mutate() but I haven't gotten very far, any help on what other functions I could try would be appreciated.

Example input data:

df <- structure(list(Gene = c("Gene1", "Gene2", "Gene3"), pvalue_group1 = c(0.01, 
NA, 0.05), pvalue_group2 = c(0.2, 0.001, 0.7), pvalue_group3 = c(NA, 
NA, 0.09), pvalue_group4 = c(NA, NA, 0.2), pvalue_group5 = c(NA, 
NA, 0.001)), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"))

CodePudding user response:

You could try something like this:

library(tidyverse)

df %>%
  pivot_longer(-1) %>%
  filter(!is.na(value)) %>%
  select(-value) %>%
  group_by(Gene) %>%
  summarise(Group = list(name)) %>%
  select(Group) %>%
  bind_cols(df) %>%
  relocate(Group, .after = "pvalue_group5")

# # A tibble: 3 × 7
#   Gene  pvalue_group1 pvalue_group2 pvalue_group3 pvalue_group4 pvalue_group5 Group    
#   <chr>         <dbl>         <dbl>         <dbl>         <dbl>         <dbl> <list>   
# 1 Gene1          0.01         0.2           NA             NA          NA     <chr [2]>
# 2 Gene2         NA            0.001         NA             NA          NA     <chr [1]>
# 3 Gene3          0.05         0.7            0.09           0.2         0.001 <chr [5]>

Alternatively, you could replace the summarise step with summarise(Group = paste(name, collapse = ", ")) for a string-like output:

df %>%
  pivot_longer(-1) %>%
  filter(!is.na(value)) %>%
  select(-value) %>%
  group_by(Gene) %>%
  summarise(Group = paste(name, collapse = ", ")) %>%
  select(Group) %>%
  bind_cols(df) %>%
  relocate(Group, .after = "pvalue_group5")

# # A tibble: 3 × 7
#   Gene  pvalue_group1 pvalue_group2 pvalue_group3 pvalue_group4 pvalue_group5 Group                                                                  
#   <chr>         <dbl>         <dbl>         <dbl>         <dbl>         <dbl> <chr>                                                                  
# 1 Gene1          0.01         0.2           NA             NA          NA     pvalue_group1, pvalue_group2                                           
# 2 Gene2         NA            0.001         NA             NA          NA     pvalue_group2                                                          
# 3 Gene3          0.05         0.7            0.09           0.2         0.001 pvalue_group1, pvalue_group2, pvalue_group3, pvalue_group4, pvalue_gro…

CodePudding user response:

Here is another option: Main feature is to use cur_column(). We check if across the columns the value is not NA, If so then put in column name .. and so on:

library(tidyverse)

df %>% 
  mutate(across(-1, ~case_when(!is.na(.) ~ cur_column()), .names = 'new_{col}')) %>%
  unite(Group, starts_with('new'), na.rm = TRUE, sep = ' ') %>% 
  mutate(Group = str_replace_all(Group, 'pvalue_', ''))
   Gene pvalue_group1 pvalue_group2 pvalue_group3 pvalue_group4 pvalue_group5                              Group
1 Gene1          0.01         0.200            NA            NA            NA                      group1 group2
2 Gene2            NA         0.001            NA            NA            NA                             group2
3 Gene3          0.05         0.700          0.09           0.2         0.001 group1 group2 group3 group4 group5

CodePudding user response:

library(data.table)

setDT(df)

x <- names(df)[-1]

df[, test := apply( .SD, 1, \(i) paste(x[which(!is.na(i))], collapse=',') ), .SDcols=x]

CodePudding user response:

Another possible solution:

library(tidyverse)
  
df %>% 
  mutate(pmap_dfr(.[,-1], ~ list(Group = names(df[,-1])[!is.na(c(...))] %>% 
    str_remove_all("pvalue_") %>% str_c(collapse = ", "))))

#>    Gene pvalue_group1 pvalue_group2 pvalue_group3 pvalue_group4 pvalue_group5
#> 1 Gene1          0.01         0.200            NA            NA            NA
#> 2 Gene2            NA         0.001            NA            NA            NA
#> 3 Gene3          0.05         0.700          0.09           0.2         0.001
#>                                    Group
#> 1                         group1, group2
#> 2                                 group2
#> 3 group1, group2, group3, group4, group5

Yet another possible solution:

library(tidyverse)

df %>%
  rowwise() %>%
  mutate(Group = str_c(names(df[-1])[!is.na(c_across(-Gene))], collapse=", ") %>% 
    str_remove_all("pvalue_")) %>%
  ungroup

Output:

  Gene  pvalue_group1 pvalue_group2 pvalue_group3 pvalue_group4 pvalue_group5 Group                                 
  <chr>         <dbl>         <dbl>         <dbl>         <dbl>         <dbl> <chr>                                 
1 Gene1          0.01         0.2           NA             NA          NA     group1, group2                        
2 Gene2         NA            0.001         NA             NA          NA     group2                                
3 Gene3          0.05         0.7            0.09           0.2         0.001 group1, group2, group3, group4, group5
  • Related