I have a csv data file that looks like this:
> head(df)
# A tibble: 6 x 6
Name C H N O S
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 'Alanine' 3 7 1 2 0
2 'Arginine' 6 14 4 2 0
3 'Cysteine' 3 7 1 2 1
4 'Sucrose' 12 22 0 11 0
5 'Fructose' 6 12 0 6 0
6 'Ribose' 5 10 0 5 0
I wanted to paste all these different columns into one so that I will have one molecular formula for each row. I initially tried to do this by simply pasting values from each column:
> for (i in c(1:nrow(df)) {
df$formula[i] <- paste0("C", df$C[i], "H", df$H[i], "N", df$N[i],
"O", df$O[i], "S", df$S[i]) }
This works if there are no zeros in columns C
through S
, but if there are more than one zeros in columns, it will just paste zeroes like below, but I would like to have molecular formulas without zeroes.
> head(df$formula)
[1] "C3H7N1O2S0" "C6H14N4O2S0" "C3H7N1O2S1" "C12H22N0O11S0" "C6H12N0O6S0" "C5H10N0O5S0"
# what I want instead: "C3H7N1O2" "C6H14N4O2" "C3H7N1O2S1" "C12H22O11" "C6H12O6" "C5H10O5"
Would there be any other way to paste these columns only if the values are not zero? Or would pasting columns like this, then removing parts of the expression with zero be easier?
CodePudding user response:
I think removing the 0
parts is the easiest way:
library(dplyr)
df %>%
mutate(formula = gsub("[CHNOS]0", "", paste0("C", C, "H", H, "N", N, "O", O, "S", S)))
returns
# A tibble: 6 x 7
Name C H N O S formula
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 'Alanine' 3 7 1 2 0 C3H7N1O2
2 'Arginine' 6 14 4 2 0 C6H14N4O2
3 'Cysteine' 3 7 1 2 1 C3H7N1O2S1
4 'Sucrose' 12 22 0 11 0 C12H22O11
5 'Fructose' 6 12 0 6 0 C6H12O6
6 'Ribose' 5 10 0 5 0 C5H10O5
A more general approach could be
library(tidyr)
library(dplyr)
df %>%
pivot_longer(-Name) %>%
filter(value > 0) %>%
group_by(Name) %>%
summarise(formula = paste(name, value, collapse = "", sep = "")) %>%
right_join(df, by = "Name")
returning
# A tibble: 6 x 7
Name formula C H N O S
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 'Alanine' C3H7N1O2 3 7 1 2 0
2 'Arginine' C6H14N4O2 6 14 4 2 0
3 'Cysteine' C3H7N1O2S1 3 7 1 2 1
4 'Fructose' C6H12O6 6 12 0 6 0
5 'Ribose' C5H10O5 5 10 0 5 0
6 'Sucrose' C12H22O11 12 22 0 11 0
CodePudding user response:
paste0
the names and values together in an interlaced way with rbind
, and then remove from the output any non-numeric characters \\D
immediately preceding a 0
, as well as the 0
:
vars <- c("C","H","N","O","S")
gsub("\\D 0", "", do.call(paste0, rbind(names(dat[vars]), as.list(dat[vars]))))
##[1] "C3H7N1O2" "C6H14N4O2" "C3H7N1O2S1" "C12H22O11" "C6H12O6" "C5H10O5"
This works because of the alternating name, then list, name, then list... that rbind
creates in column order:
rbind(names(dat[vars]), as.list(dat[vars]))
## C H N O S
##[1,] "C" "H" "N" "O" "S"
##[2,] integer,6 integer,6 integer,6 integer,6 integer,6
Where dat
was:
dat <- read.table(text="
Name C H N O S
Alanine 3 7 1 2 0
Arginine 6 14 4 2 0
Cysteine 3 7 1 2 1
Sucrose 12 22 0 11 0
Fructose 6 12 0 6 0
Ribose 5 10 0 5 0
", header=TRUE)