Home > Back-end >  Create a column in dataframe that contains information on the other columns
Create a column in dataframe that contains information on the other columns

Time:10-05

I would like to generate a column in a big data frame which is contains the information of the other columns. I am giving a very small reproducible example:

tax <- data.frame(
Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"), 
Genus = c("NA" ,"Pinus", "NA", "Lilia"),
Species = c("NA" ,"Pinus_sylvestris", "NA", "Calochortus nuttallii"))

I would like to create a column called tax_rank where your taxonomy reaches species will have the value species but if you reach higher rank like genus than the value will be genus or familylike this output:

tax <- data.frame(
Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"), 
Genus = c("NA" ,"Pinus", "NA", "Lilia"),
Species = c("NA" ,"Pinus_sylvestris", "NA", "Calochortus nuttallii"),
tax_rank = c("family" ,"species", "family", "species"))

But I would like to do it automatically with a big dataset, is it possible with dplyr? Thanks!

CodePudding user response:

In base R, you can use max.col on the non-NA values and choose ties.method = "last" to keep the latest non-NA value.

names(tax)[max.col(!is.na(tax), ties.method = "last")]

This is possible to translate that into dplyr:

library(dplyr)
tax %>% 
  mutate(tax_rank = names(tax)[max.col(!is.na(tax), ties.method = "last")])

#        Family Genus               Species tax_rank
# 1 Brassicacae  <NA>                  <NA>   Family
# 2    Pinaceae Pinus      Pinus_sylvestris  Species
# 3    Rosaceae  <NA>                  <NA>   Family
# 4   Liliaceae Lilia Calochortus nuttallii  Species

Data (note that I convert "NA" to NA)

tax <- data.frame(
  Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"), 
  Genus = c(NA ,"Pinus", NA, "Lilia"),
  Species = c(NA ,"Pinus_sylvestris", NA, "Calochortus nuttallii"))

CodePudding user response:

First, your dataframe should be with NAobjects, no as characters:

tax <- data.frame(
Family = c("Brassicacae", "Pinaceae", "Rosaceae", "Liliaceae"), 
Genus = c(NA ,"Pinus",NA, "Lilia"),
Species = c(NA,"Pinus_sylvestris", NA, "Calochortus nuttallii"))

Then the column you want is the next one

tax %>% mutate(tax_rank = ifelse(!is.na(Species), "species", ifelse(!is.na(Genus), "genus", "family")))

This is the output

       Family Genus               Species tax_rank
1 Brassicacae  <NA>                  <NA>   family
2    Pinaceae Pinus      Pinus_sylvestris  species
3    Rosaceae  <NA>                  <NA>   family
4   Liliaceae Lilia Calochortus nuttallii  species

CodePudding user response:

Using base R

tax$tax_rank <- apply(tax, 1, \(x) tail(names(x)[!is.na(x)], 1))

-output

> tax
       Family Genus               Species tax_rank
1 Brassicacae  <NA>                  <NA>   Family
2    Pinaceae Pinus      Pinus_sylvestris  Species
3    Rosaceae  <NA>                  <NA>   Family
4   Liliaceae Lilia Calochortus nuttallii  Species
  • Related