Home > Software design >  How to add information from Data Frame #1 to Data Frame #2 if #1 contains info not found in #2, but
How to add information from Data Frame #1 to Data Frame #2 if #1 contains info not found in #2, but

Time:01-30

I have a dataframe of bird nest boxes from 2021 and one from 2020. The one from 2021 has fewer fields, and there are NA values for boxes within the dataframe from 2021 that are filled out in the dataframe from 2020. However, in some instances, information for a box differs between the dataframes in the fields that they share (ie bf2 is "plastic" in 2020 and "wooden" in 2021). I want to take all information that is filled out in 2021 as correct, and to this dataframe add:

  1. the boxes only listed in dataframe 2020 (rf3, rf4, rf6, rf7) with all corresponding fields of information filled out
  2. Append the columns in 2020 but not in 2021 (box.age and land.water) to boxes listed in 2021 (bf1, bf2, rf1).
  3. Fill in any missing information that 2020 has but 2021 does not (bf8 boxes.per.post, bf9 habitat.type and box.material)
#Sample Code
box.id <- c("bf1" , "bf2", "bf3", "bf4", "rf1", "rf2", "bf8", "bf9")
boxes.per.post <- c("single", "single", "single", "single", "duplex", "single", NA, "duplex")
habitat.type <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Pond", "Pond", "Ephemeral Wetland", "Pond", NA)
box.material <- c("wooden", "wooden", "wooden", "wooden", "wooden", "wooden", "plastic", NA)

Data2021 <- data.frame(box.id, boxes.per.post, habitat.type, box.material)

box.id.1 <- c("bf1", "bf2", "rf1", "rf3", "rf4", "rf6", "rf7", "bf8", "bf9")
boxes.per.post.1 <- c("single", "single", "single", "single", "duplex", "single", "duplex", "single", "single")
habitat.type.1 <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Ephemeral Wetland",  
                    "Ephemeral Wetland", "Ephemeral Wetland", "River/Creek", "Pond", "Pond")
box.material <- c("wooden", "plastic", "plastic", "wooden","wooden","wooden","wooden", "plastic", "wooden")
box.age <- c(1,3,4,5,6,3,7,8, 12)
land.water <- c("land", "water", "land", "water", "water", "water", "water", "land", "water")

Data2020 <- data.frame(box.id.1, boxes.per.post.1, habitat.type.1, box.material, box.age, land.water)

colnames(Data2020) <- c("box.id", "boxes.per.post", "habitat.type", "box.material", "box.age", "land.water")

#End Sample Code

I tried the following:

#Start attempts
cbind(Data2021, Data2020)                                                         #error... different number of rows

Data.merged <- left_join(Data2021, Data2020, by = "box.id")                       #doesn't add rf3-rf7 

Data.merged.1 <- left_join(Data2021, Data2020, 
                           by = c("box.id", "boxes.per.post", "habitat.type", "box.material"))  #only appends rows that match completely

Data.merged.2 <- full_join(Data2021, Data2020, by = "box.id")         
#creates separate columns for matching fields (ie boxes.per.post.x and boxes.per.post.y)


Data.merged.3 <- merge(Data2021, Data2020, by = "box.id", all.x = TRUE, all.y = TRUE) #creates extra columns again
#End attempts

The main issue I am running into is the instances in which information does not agree for a single box between data sheets. I imagine this will take several steps to complete, I'm just struggling putting the necessary steps together.

Here is my expected result:

box.id.2 <- c("bf1" , "bf2", "bf3", "bf4", "rf1", "rf2", "bf8", "bf9", "rf3", "rf4", "rf6", "rf7")
boxes.per.post.2 <- c("single", "single", "single", "single", "duplex", "single", "single", "duplex", "single", "duplex",
                      "single", "duplex")
habitat.type.2 <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Pond", "Pond", "Ephemeral Wetland", "Pond", "Pond",
                    "Ephemeral Wetland", "Ephemeral Wetland", "Ephemeral Wetland", "River/Creek")
box.material.2 <- c("wooden", "wooden", "wooden", "wooden", "wooden", "wooden", "plastic", "wooden", "wooden", "wooden",
                    "wooden","wooden")
box.age.2 <- c(1,3,NA, NA,4,NA,8,12,5,6,3,7)
land.water.2 <- c("land", "water", NA, NA, "land", NA, "land", "water", "water", "water", "water", "water")

Data.expected <- data.frame(box.id.2, boxes.per.post.2, habitat.type.2, box.material.2, box.age.2, land.water.2)
colnames(Data.expected) <- c("box.id", "boxes.per.post", "habitat.type", "box.material", "box.age", "land.water")

CodePudding user response:

Your Data.expected has a different number of rows as both of your input sets, but I think what you need is a coalesce (on all duplicate columns) after your merge.

Try this:

left_join(Data2020, Data2021, by = "box.id") %>%
  mutate(across(ends_with(".x"), ~ coalesce(cur_data()[[ sub("\\.x$", ".y", cur_column()) ]], .))) %>%
  select(-ends_with(".y")) %>%
  rename_with(.fn = ~ sub("\\.x$", "", .))
#   box.id boxes.per.post      habitat.type box.material box.age land.water
# 1    bf1         single Ephemeral Wetland       wooden       1       land
# 2    bf2         single       River/Creek       wooden       3      water
# 3    rf1         duplex              Pond       wooden       4       land
# 4    rf3         single Ephemeral Wetland       wooden       5      water
# 5    rf4         duplex Ephemeral Wetland       wooden       6      water
# 6    rf6         single Ephemeral Wetland       wooden       3      water
# 7    rf7         duplex       River/Creek       wooden       7      water
# 8    bf8         single              Pond      plastic       8       land
# 9    bf9         duplex              Pond       wooden      12      water

coalesce is a function that returns the first values in its vectors that are non-NA. It is vectorized, and accepts one or more vectors, so the following demonstrate some of its functionality:

coalesce(1, NA) # no change
# [1] 1
coalesce(NA, 1)
# [1] 1
coalesce(2, NA, 1) # no change
# [1] 2
coalesce(NA, NA, 1)
# [1] 1

In each case, it's using the first non-NA value vector-wise.

Ways that this might need to be adjusted, based on things I don't know:

  • full_join instead of left_join, this might account for the additional columns in your Data.expected
  • Wrap the order of what is being coalesced. As it stands here, I'm prioritizing non-NA values in RHS (Data2021, joined-columns named *.x) over the LHS (.y), that was an arbitrary choice.

across is a good way to iterate the same function/process over multiple columns. Since it only works on one column at a time (the internal expression/code only sees a vector of values), we can use the special functions of cur_column() (for example, it might return the string "land.water.x"), sub to remove the ending literal ".y", and cur_data()[[ ... ]] to retrieve the other column of the c("", ".y") pair. (There are certainly other tricks to throw at this, including the use of .names="..." in across, but I didn't feel they added functionality here.)

CodePudding user response:

You may also try:

merge <- full_join(Data2021, Data2020) |> 
    group_by(box.id) |> 
    summarise(across(everything(), max, na.rm=TRUE)) |> 
    mutate(box.age = na_if(box.age, -Inf)
  • Related