Home > Software design >  How to append text to a column based on conditions?
How to append text to a column based on conditions?

Time:11-09

I have an empty column designated for categorising entries in my data frame. Categories are not exclusive, i.e. one entry can have multiple categories.

         animals categories
1         monkey      
2 humpback whale      
3    river trout      
4        seagull      

The categories column should have categories based on the animal's properties. I know the properties based on vectors. The elements in the vectors aren't necessarily a perfect match.

mammals <- c("whale", "monkey", "dog")
swimming <- c("whale", "trout", "dolphin")

How do I get the following result, ideally without looping?

         animals      categories
1         monkey          mammal     
2 humpback whale mammal,swimming     
3    river trout        swimming     
4        seagull      

CodePudding user response:

This may be done with fuzzyjoin after creating a key/val dataset - lst from dplyr returns a named list, which is converted to a two column dataset with enframe, unnest the list column, grouped by 'animals', paste the 'categories' to a single string and then do a join (regex_left_join) with the original dataset

library(fuzzyjoin)
library(dplyr)
library(tidyr)
library(tibble)
keydat <- lst(mammals, swimming) %>%
     enframe(name = 'categories', value = 'animals') %>% 
     unnest(animals) %>%
     group_by(animals) %>% 
     summarise(categories = toString(categories))
regex_left_join(df1, keydat, by= 'animals', ignore_case = TRUE) %>% 
     transmute(animals = animals.x, categories)
# A tibble: 4 × 2
  animals        categories       
  <chr>          <chr>            
1 monkey         mammals          
2 humpback whale mammals, swimming
3 river trout    swimming         
4 seagull        <NA>       

data

df1 <- tibble(animals = c('monkey', 'humpback whale', 'river trout', 'seagull'))

CodePudding user response:

A base R option using stack aggregate grepl

lut <- aggregate(
  . ~ values,
  type.convert(
    stack(list(mammals = mammals, swimming = swimming)),
    as.is = TRUE
  ),
  toString
)
p <- sapply(
  lut$values,
  grepl,
  x = df$animals
)
df$categories <- lut$ind[replace(rowSums(p * col(p)), rowSums(p) == 0, NA)]

which gives

> df
         animals        categories
1         monkey           mammals
2 humpback whale mammals, swimming
3    river trout          swimming
4        seagull              <NA>

Data

df <- data.frame(animals = c("monkey", "humpback whale", "river trout", "seagull"))
  • Related