Home > Software engineering >  Create new columns in a dataframe from another with multiple options
Create new columns in a dataframe from another with multiple options

Time:11-26

a dataframe as list:

dfcheck <- data.frame(status = c("open/close", "close", "open"), stock = c("company energy","goods and books","other"), name = c("amazon1;google1","google3;yahoo1","yahoo2;amazon2;google2"))

And an input dataframe like this:

dfdata <- data.frame(id = c("id1", "id2", "id3"), title1 = c("amazon1","google1","yahoo1"), title2 = c("yahoo2",NA,"amazon2"))

How is it possible to produce a dataframe with columns based the previous list:

Expected output: dfdata <- data.frame(id = c("id1", "id2", "id3"), title1 = c("amazon1","google1","yahoo1"), title2 = c("yahoo2",NA,"amazon2"), status1 = c("open/close","open/close","close"), stock1 = c("company energy","company energy","goods and books"), status2 = c("open",NA,"open"), stock2 = c("other",NA,"other"))

 id  title1  title2    status1          stock1 status2
1 id1 amazon1  yahoo2 open/close  company energy    open
2 id2 google1    <NA> open/close  company energy    <NA>
3 id3  yahoo1 amazon2      close goods and books    open
  stock2
1  other
2   <NA>
3  other

This dataframe checks in dfdata in every column, expect the first id column, if any of the values in dfcheck dataframe exist and creates two new columns with the status and stock of dfcheck. From the dfcheck the column name has more than one values separated by ";"

CodePudding user response:

Libraries:

library(dplyr)
library(stringr)
library(tidyr)

First you'd need to tidy your dfcheck data.frame:

dfcheck_tidy <- dfcheck %>%
  mutate(name = str_split(name, ";")) %>%
  unnest(name)

(I'm not using tidyr::separate to do this as it seems from your example that you can have a variable length of names separated by ";".)

And now you can perform the two joins:

dfdata %>%
  left_join(dfcheck_tidy,
            by = c("title1" = "name")) %>%
  left_join(dfcheck_tidy,
            by = c("title2" = "name"),
            suffix = c("1", "2"))
#    id  title1  title2    status1          stock1 status2 stock2
# 1 id1 amazon1  yahoo2 open/close  company energy    open  other
# 2 id2 google1    <NA> open/close  company energy    <NA>   <NA>
# 3 id3  yahoo1 amazon2      close goods and books    open  other

CodePudding user response:

Here's an approach using regex_join() from the fuzzyjoin package.

library(dplyr)
library(fuzzyjoin)
regex_right_join(dfcheck, dfdata, by = c(name = "title1")) %>% 
  regex_right_join(dfcheck, ., by = c(name = "title2")) %>% 
  select(!contains("name")) %>% 
  relocate(id, title1, title2) 
   id  title1  title2 status.x stock.x   status.y         stock.y
1 id1 amazon1  yahoo2     open   other open/close  company energy
2 id2 google1    <NA>     <NA>    <NA> open/close  company energy
3 id3  yahoo1 amazon2     open   other      close goods and books
  •  Tags:  
  • r
  • Related