How to create a new column from existing two columns but omitting NAs rows in R-CodePudding

I have a data frame that a portion of it looks like this:

Domain <- c(rep("Bacteria",3),rep("Archaea", 2))
Phylum <- c("Proteobacteria","Cyanobacteria","Planctomycetota", "Thermoplasmatota", "Thermoplasmatota")
Class <- c("Alphaproteobacteria","Cyanobacteriia","Phycisphaerae","Poseidoniia_A",NA)
Order <- c("Sphingomonadales", NA, "Phycisphaerales", "Poseidoniales", NA)
Family <- c("Emcibacteraceae", NA, NA, "Poseidonia", NA)
Genus <- c("UBA4441", NA,NA,NA,NA)
Species <- c("UBA4441 sp", NA,NA,NA,NA)


demo_table <- data.frame(Domain, Phylum, Class, Order, Family, Genus, Species)

The point here is I want to create a new column called "assignation" that consist in the merge of the last two columns that contain non NA values row by row and that the values are separated by a space.

This is the expected output:

Domain	Phylum	Class	Order	Family	Genus	Species	assignation
Bacteria	Proteobacteria	Alphaproteobacteria	Sphingomonadales	Emcibacteraceae	UBA4441	UBA4441 sp	UBA4441 UBA4441 sp
Bacteria	Cyanobacteria	Cyanobacteriia	NA	NA	NA	NA	Cyanobacteria Cyanobacteriia
Bacteria	Planctomycetota	Phycisphaerae	Phycisphaerales	NA	NA	NA	Phycisphaerae Phycisphaerales
Archaea	Thermoplasmatota	Poseidoniia_A	Poseidoniales	Poseidonia	NA	NA	Poseidoniales Poseidonia
Archaea	Thermoplasmatota	NA	NA	NA	NA	NA	Archaea Thermoplasmatota

I Think that paste() may work on this case but not sure how to implement it so I can get the above mentioned expected output data frame.

CodePudding user response：

We may use base R - loop over the rows, remove the NA with na.omit, get the last two elements tail with n = 2 and paste

demo_table$assignation <- apply(demo_table, 1, 
   function(x) paste(tail(na.omit(x), 2), collapse = " "))

-output

demo_table$assignation
[1] "UBA4441 UBA4441 sp"            "Cyanobacteria Cyanobacteriia"  "Phycisphaerae Phycisphaerales" "Poseidoniales Poseidonia"     
[5] "Archaea Thermoplasmatota"

With tidyverse, we may also use unite and remove the NA with na.rm = TRUE, then extract the last two words

library(dplyr)
library(tidyr)
library(stringr)
demo_table %>% 
  unite(assignation, everything(), na.rm = TRUE, remove = FALSE) %>% 
  mutate(assignation = str_replace(assignation,     
     ".*_([^_] )_([^_] )$", "\\1 \\2")) %>% 
  relocate(assignation, .after =last_col())

CodePudding user response：

If you want to go for a tidyverse approach, you just need to use rowwise c_across. I think is it also nice to transform this operation in a function, in case you need to use later or even change the behavior of it.

Code

library(dplyr)

select_last <- function(x, n = 2){paste(tail(na.omit(x),n = n),collapse = " ")}

demo_table %>% 
  rowwise() %>% 
  mutate(assignation  = select_last(c_across()))

Output

# A tibble: 5 x 8
# Rowwise: 
  Domain  Phylum      Class        Order       Family     Genus Species  assignation        
  <chr>   <chr>       <chr>        <chr>       <chr>      <chr> <chr>    <chr>              
1 Bacter~ Proteobact~ Alphaproteo~ Sphingomon~ Emcibacte~ UBA4~ UBA4441~ UBA4441 UBA4441 sp 
2 Bacter~ Cyanobacte~ Cyanobacter~ NA          NA         NA    NA       Cyanobacteria Cyan~
3 Bacter~ Planctomyc~ Phycisphaer~ Phycisphae~ NA         NA    NA       Phycisphaerae Phyc~
4 Archaea Thermoplas~ Poseidoniia~ Poseidonia~ Poseidonia NA    NA       Poseidoniales Pose~
5 Archaea Thermoplas~ NA           NA          NA         NA    NA       Archaea Thermoplas~

CodePudding user response：

Here is dplyr combined with tidyr approach:

library(dplyr)
library(tidyr)
demo_table %>% 
  mutate(id = row_number()) %>% 
  pivot_longer(-id) %>% 
  group_by(id) %>% 
  na.omit() %>% 
  arrange(-row_number(), .by_group = TRUE) %>% 
  mutate(assignation = paste(value[1], value[2], sep = "\n")) %>% 
  slice(1) %>% 
  ungroup() %>% 
  select(assignation) %>% 
  bind_cols(demo_table) %>% 
  View()