R table with number of atoms, concatenate/paste to molecular formula-CodePudding

I have a csv data file that looks like this:

> head(df)
# A tibble: 6 x 6
  Name                        C     H     N     O     S
  <chr>                     <dbl> <dbl> <dbl> <dbl> <dbl>
1 'Alanine'                   3     7     1     2     0         
2 'Arginine'                  6     14    4     2     0     
3 'Cysteine'                  3     7     1     2     1
4 'Sucrose'                   12    22    0     11    0
5 'Fructose'                  6     12    0     6     0  
6 'Ribose'                    5     10    0     5     0

I wanted to paste all these different columns into one so that I will have one molecular formula for each row. I initially tried to do this by simply pasting values from each column:

> for (i in c(1:nrow(df)) {
df$formula[i] <- paste0("C", df$C[i], "H", df$H[i], "N", df$N[i],
                        "O", df$O[i], "S", df$S[i]) }

This works if there are no zeros in columns C through S, but if there are more than one zeros in columns, it will just paste zeroes like below, but I would like to have molecular formulas without zeroes.

> head(df$formula)
[1] "C3H7N1O2S0"  "C6H14N4O2S0"  "C3H7N1O2S1"  "C12H22N0O11S0"  "C6H12N0O6S0"  "C5H10N0O5S0"
# what I want instead: "C3H7N1O2"  "C6H14N4O2"  "C3H7N1O2S1"  "C12H22O11"  "C6H12O6"  "C5H10O5"

Would there be any other way to paste these columns only if the values are not zero? Or would pasting columns like this, then removing parts of the expression with zero be easier?

CodePudding user response：

I think removing the 0 parts is the easiest way:

library(dplyr)

df %>% 
  mutate(formula = gsub("[CHNOS]0", "", paste0("C", C, "H", H, "N", N, "O", O, "S", S)))

returns

# A tibble: 6 x 7
  Name           C     H     N     O     S formula   
  <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <chr>     
1 'Alanine'      3     7     1     2     0 C3H7N1O2  
2 'Arginine'     6    14     4     2     0 C6H14N4O2 
3 'Cysteine'     3     7     1     2     1 C3H7N1O2S1
4 'Sucrose'     12    22     0    11     0 C12H22O11 
5 'Fructose'     6    12     0     6     0 C6H12O6   
6 'Ribose'       5    10     0     5     0 C5H10O5

A more general approach could be

library(tidyr)
library(dplyr)

df %>% 
  pivot_longer(-Name) %>% 
  filter(value > 0) %>% 
  group_by(Name) %>% 
  summarise(formula = paste(name, value, collapse = "", sep = "")) %>% 
  right_join(df, by = "Name")

returning

# A tibble: 6 x 7
  Name       formula        C     H     N     O     S
  <chr>      <chr>      <dbl> <dbl> <dbl> <dbl> <dbl>
1 'Alanine'  C3H7N1O2       3     7     1     2     0
2 'Arginine' C6H14N4O2      6    14     4     2     0
3 'Cysteine' C3H7N1O2S1     3     7     1     2     1
4 'Fructose' C6H12O6        6    12     0     6     0
5 'Ribose'   C5H10O5        5    10     0     5     0
6 'Sucrose'  C12H22O11     12    22     0    11     0

CodePudding user response：

paste0 the names and values together in an interlaced way with rbind, and then remove from the output any non-numeric characters \\D immediately preceding a 0, as well as the 0:

vars <- c("C","H","N","O","S")
gsub("\\D 0", "", do.call(paste0, rbind(names(dat[vars]), as.list(dat[vars]))))
##[1] "C3H7N1O2"   "C6H14N4O2"  "C3H7N1O2S1" "C12H22O11"  "C6H12O6"    "C5H10O5"

This works because of the alternating name, then list, name, then list... that rbind creates in column order:

rbind(names(dat[vars]), as.list(dat[vars]))
##     C         H         N         O         S        
##[1,] "C"       "H"       "N"       "O"       "S"      
##[2,] integer,6 integer,6 integer,6 integer,6 integer,6

Where dat was:

dat <- read.table(text="
Name        C     H     N     O     S
Alanine     3     7     1     2     0         
Arginine    6    14     4     2     0     
Cysteine    3     7     1     2     1
Sucrose    12    22     0    11     0
Fructose    6    12     0     6     0  
Ribose      5    10     0     5     0  
", header=TRUE)