I would like to generate a column in a big data frame which is contains the information of the other columns. I am giving a very small reproducible example:
tax <- data.frame(
Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"),
Genus = c("NA" ,"Pinus", "NA", "Lilia"),
Species = c("NA" ,"Pinus_sylvestris", "NA", "Calochortus nuttallii"))
I would like to create a column called tax_rank where your taxonomy reaches species will have the value species
but if you reach higher rank like genus than the value will be genus
or family
like this output:
tax <- data.frame(
Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"),
Genus = c("NA" ,"Pinus", "NA", "Lilia"),
Species = c("NA" ,"Pinus_sylvestris", "NA", "Calochortus nuttallii"),
tax_rank = c("family" ,"species", "family", "species"))
But I would like to do it automatically with a big dataset, is it possible with dplyr
?
Thanks!
CodePudding user response:
In base R
, you can use max.col
on the non-NA values and choose ties.method = "last"
to keep the latest non-NA value.
names(tax)[max.col(!is.na(tax), ties.method = "last")]
This is possible to translate that into dplyr
:
library(dplyr)
tax %>%
mutate(tax_rank = names(tax)[max.col(!is.na(tax), ties.method = "last")])
# Family Genus Species tax_rank
# 1 Brassicacae <NA> <NA> Family
# 2 Pinaceae Pinus Pinus_sylvestris Species
# 3 Rosaceae <NA> <NA> Family
# 4 Liliaceae Lilia Calochortus nuttallii Species
Data (note that I convert "NA"
to NA
)
tax <- data.frame(
Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"),
Genus = c(NA ,"Pinus", NA, "Lilia"),
Species = c(NA ,"Pinus_sylvestris", NA, "Calochortus nuttallii"))
CodePudding user response:
First, your dataframe should be with NA
objects, no as characters:
tax <- data.frame(
Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"),
Genus = c(NA ,"Pinus",NA, "Lilia"),
Species = c(NA,"Pinus_sylvestris", NA, "Calochortus nuttallii"))
Then the column you want is the next one
tax %>% mutate(tax_rank = ifelse(!is.na(Species), "species", ifelse(!is.na(Genus), "genus", "family")))
This is the output
Family Genus Species tax_rank
1 Brassicacae <NA> <NA> family
2 Pinaceae Pinus Pinus_sylvestris species
3 Rosaceae <NA> <NA> family
4 Liliaceae Lilia Calochortus nuttallii species
CodePudding user response:
Using base R
tax$tax_rank <- apply(tax, 1, \(x) tail(names(x)[!is.na(x)], 1))
-output
> tax
Family Genus Species tax_rank
1 Brassicacae <NA> <NA> Family
2 Pinaceae Pinus Pinus_sylvestris Species
3 Rosaceae <NA> <NA> Family
4 Liliaceae Lilia Calochortus nuttallii Species