I have a data table containing a specific set of genes in one column and another set of significant genes in another column on my table. Both are character variables. How do I find the overlap of these genes and print into another column?
Example:
a <- c('apple banana melon pear ', 'pear kiwi pineapple', 'avocado lime kiwi apple', 'lime pineapple banana melon') b <- c('blah blah blah banana pear', 'blah pear blah blah kiwi', 'blah blah blah apple', 'lime blah blah blah') df <- data.frame(a,b)
What I want to return is df$new_column of c('banana pear', 'pear kiwi', 'apple', 'lime')
I have tried:
df$new_column<- df$a[df$a %in% df$b], but I am getting the error message
Error in $<-.data.frame
(*tmp*
, new_column, value = character(0)) :
replacement has 0 rows, data has 4
CodePudding user response:
Those strings have to be separated into words first, then we can use intersect() on pairs of those sets.
With base R perhaps something like this:
df <- data.frame(a,b)
# split strings and find intersections, paste back together
df$new_column <- mapply(\(a,b) paste(intersect(a,b), collapse = " ") ,
strsplit(df$a, " ",),
strsplit(df$b, " ",))
df
#> a b new_column
#> 1 apple banana melon pear blah blah blah banana pear banana pear
#> 2 pear kiwi pineapple blah pear blah blah kiwi pear kiwi
#> 3 avocado lime kiwi apple blah blah blah apple apple
#> 4 lime pineapple banana melon lime blah blah blah lime
# all values are just plain strings:
str(df)
#> 'data.frame': 4 obs. of 3 variables:
#> $ a : chr "apple banana melon pear " "pear kiwi pineapple" "avocado lime kiwi apple" "lime pineapple banana melon"
#> $ b : chr "blah blah blah banana pear" "blah pear blah blah kiwi" "blah blah blah apple" "lime blah blah blah"
#> $ new_column: chr "banana pear" "pear kiwi" "apple" "lime"
Alternatively:
library(dplyr, warn.conflicts = F)
library(stringr)
library(purrr)
# with Tidyverse and list columns:
df_lc <- df %>% mutate(across(c(a,b), ~ str_split(.x, " "))) %>%
mutate(new_col = map2(a,b, ~ intersect(.x,.y)))
# now we have list columns:
df_lc["new_col"]
#> new_col
#> 1 banana, pear
#> 2 pear, kiwi
#> 3 apple
#> 4 lime
# when printing a tibble it's bit more evident:
as_tibble(df_lc)
#> # A tibble: 4 × 4
#> a b new_column new_col
#> <list> <list> <chr> <list>
#> 1 <chr [5]> <chr [5]> banana pear <chr [2]>
#> 2 <chr [3]> <chr [5]> pear kiwi <chr [2]>
#> 3 <chr [4]> <chr [4]> apple <chr [1]>
#> 4 <chr [4]> <chr [4]> lime <chr [1]>
str(df_lc)
#> 'data.frame': 4 obs. of 4 variables:
#> $ a :List of 4
#> ..$ : chr "apple" "banana" "melon" "pear" ...
#> ..$ : chr "pear" "kiwi" "pineapple"
#> ..$ : chr "avocado" "lime" "kiwi" "apple"
#> ..$ : chr "lime" "pineapple" "banana" "melon"
#> $ b :List of 4
#> ..$ : chr "blah" "blah" "blah" "banana" ...
#> ..$ : chr "blah" "pear" "blah" "blah" ...
#> ..$ : chr "blah" "blah" "blah" "apple"
#> ..$ : chr "lime" "blah" "blah" "blah"
#> $ new_column: chr "banana pear" "pear kiwi" "apple" "lime"
#> $ new_col :List of 4
#> ..$ : chr "banana" "pear"
#> ..$ : chr "pear" "kiwi"
#> ..$ : chr "apple"
#> ..$ : chr "lime"
Input:
a <- c('apple banana melon pear ', 'pear kiwi pineapple', 'avocado lime kiwi apple', 'lime pineapple banana melon')
b <- c('blah blah blah banana pear', 'blah pear blah blah kiwi', 'blah blah blah apple', 'lime blah blah blah')
Created on 2023-01-20 with reprex v2.0.2