Home > Software design >  Combining two dataframes based on presence of values of various different columns
Combining two dataframes based on presence of values of various different columns

Time:05-13

I have a question about creating new columns in my dataset by checking whether a value is present in one of the columns of my dataframe, and assigning the columns of a different dataframe based on that presence. As this description is quite vague, see the example dataset below:

newDf <- data.frame(c("Juice 1", "Juice 2", "Juice 3", "Juice 4","Juice 5"),
                    c("Banana", "Banana", "Orange", "Pear", "Apple"),
                    c("Blueberry", "Mango", "Rasberry", "Spinach", "Pear"),
                    c("Kale", NA, "Cherry", NA, "Peach"))
colnames(newDf) <- c("Juice", "Fruit 1", "Fruit 2", "Fruit 3")


dfChecklist <- data.frame(c("Banana", "Cherry"),
                          c("100", "80"),
                          c("5", "3"),
                          c("4", "5"))
colnames(dfChecklist) <- c("FruitCheck", "NutritionalValue", "Deliciousness", "Difficulty")

This gives the following dataframes:

    Juice Fruit 1   Fruit 2 Fruit 3
1 Juice 1  Banana Blueberry    Kale
2 Juice 2  Banana     Mango    <NA>
3 Juice 3  Orange  Rasberry  Cherry
4 Juice 4    Pear   Spinach    <NA>
5 Juice 5   Apple      Pear   Peach


  FruitCheck NutritionalValue Deliciousness Difficulty
1     Banana              100             5          4
2     Cherry               80             3          5

I want to combine the two and make the result to be like this:

   Juice Fruit 1   Fruit 2 Fruit 3 FruitCheck NutritionalValue Deliciousness Difficulty
1 Juice 1  Banana Blueberry    Kale     Banana              100             5          4
2 Juice 2  Banana     Mango    <NA>     Banana              100             5          4
3 Juice 3  Orange  Rasberry  Cherry     Cherry               80             3          5
4 Juice 4    Pear   Spinach    <NA>       <NA>             <NA>          <NA>       <NA>
5 Juice 5   Apple      Pear   Peach       <NA>             <NA>          <NA>       <NA>

The dataset above is an example, my own dataset is much larger and complexer.

Thanks so much in advance for your help!

CodePudding user response:

First find the first match for each row

tmp=unlist(
  apply(
    newDf[,grepl("Fruit",colnames(newDf))],
    1,
    function(x){
      y=as.vector(x)
      y=y[which.min(match(y,dfChecklist$FruitCheck))]
      ifelse(length(y)==0,NA,y)
    }
  )
)

add this to your original df and then a simple merge

newDf$FruitCheck=tmp

merge(
  newDf,
  dfChecklist,
  by="FruitCheck",
  all.x=T
)

resulting in

  FruitCheck   Juice Fruit 1   Fruit 2 Fruit 3 NutritionalValue Deliciousness
1     Banana Juice 1  Banana Blueberry    Kale              100             5
2     Banana Juice 2  Banana     Mango    <NA>              100             5
3     Cherry Juice 3  Orange  Rasberry  Cherry               80             3
4       <NA> Juice 4    Pear   Spinach    <NA>             <NA>          <NA>
5       <NA> Juice 5   Apple      Pear   Peach             <NA>          <NA>
  Difficulty
1          4
2          4
3          5
4       <NA>
5       <NA>
  • Related