I have a dataframe of bird nest boxes from 2021 and one from 2020. The one from 2021 has fewer fields, and there are NA values for boxes within the dataframe from 2021 that are filled out in the dataframe from 2020. However, in some instances, information for a box differs between the dataframes in the fields that they share (ie bf2 is "plastic" in 2020 and "wooden" in 2021). I want to take all information that is filled out in 2021 as correct, and to this dataframe add:
- the boxes only listed in dataframe 2020 (rf3, rf4, rf6, rf7) with all corresponding fields of information filled out
- Append the columns in 2020 but not in 2021 (box.age and land.water) to boxes listed in 2021 (bf1, bf2, rf1).
- Fill in any missing information that 2020 has but 2021 does not (bf8 boxes.per.post, bf9 habitat.type and box.material)
#Sample Code
box.id <- c("bf1" , "bf2", "bf3", "bf4", "rf1", "rf2", "bf8", "bf9")
boxes.per.post <- c("single", "single", "single", "single", "duplex", "single", NA, "duplex")
habitat.type <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Pond", "Pond", "Ephemeral Wetland", "Pond", NA)
box.material <- c("wooden", "wooden", "wooden", "wooden", "wooden", "wooden", "plastic", NA)
Data2021 <- data.frame(box.id, boxes.per.post, habitat.type, box.material)
box.id.1 <- c("bf1", "bf2", "rf1", "rf3", "rf4", "rf6", "rf7", "bf8", "bf9")
boxes.per.post.1 <- c("single", "single", "single", "single", "duplex", "single", "duplex", "single", "single")
habitat.type.1 <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Ephemeral Wetland",
"Ephemeral Wetland", "Ephemeral Wetland", "River/Creek", "Pond", "Pond")
box.material <- c("wooden", "plastic", "plastic", "wooden","wooden","wooden","wooden", "plastic", "wooden")
box.age <- c(1,3,4,5,6,3,7,8, 12)
land.water <- c("land", "water", "land", "water", "water", "water", "water", "land", "water")
Data2020 <- data.frame(box.id.1, boxes.per.post.1, habitat.type.1, box.material, box.age, land.water)
colnames(Data2020) <- c("box.id", "boxes.per.post", "habitat.type", "box.material", "box.age", "land.water")
#End Sample Code
I tried the following:
#Start attempts
cbind(Data2021, Data2020) #error... different number of rows
Data.merged <- left_join(Data2021, Data2020, by = "box.id") #doesn't add rf3-rf7
Data.merged.1 <- left_join(Data2021, Data2020,
by = c("box.id", "boxes.per.post", "habitat.type", "box.material")) #only appends rows that match completely
Data.merged.2 <- full_join(Data2021, Data2020, by = "box.id")
#creates separate columns for matching fields (ie boxes.per.post.x and boxes.per.post.y)
Data.merged.3 <- merge(Data2021, Data2020, by = "box.id", all.x = TRUE, all.y = TRUE) #creates extra columns again
#End attempts
The main issue I am running into is the instances in which information does not agree for a single box between data sheets. I imagine this will take several steps to complete, I'm just struggling putting the necessary steps together.
Here is my expected result:
box.id.2 <- c("bf1" , "bf2", "bf3", "bf4", "rf1", "rf2", "bf8", "bf9", "rf3", "rf4", "rf6", "rf7")
boxes.per.post.2 <- c("single", "single", "single", "single", "duplex", "single", "single", "duplex", "single", "duplex",
"single", "duplex")
habitat.type.2 <- c("Ephemeral Wetland", "River/Creek", "Impoundment", "Pond", "Pond", "Ephemeral Wetland", "Pond", "Pond",
"Ephemeral Wetland", "Ephemeral Wetland", "Ephemeral Wetland", "River/Creek")
box.material.2 <- c("wooden", "wooden", "wooden", "wooden", "wooden", "wooden", "plastic", "wooden", "wooden", "wooden",
"wooden","wooden")
box.age.2 <- c(1,3,NA, NA,4,NA,8,12,5,6,3,7)
land.water.2 <- c("land", "water", NA, NA, "land", NA, "land", "water", "water", "water", "water", "water")
Data.expected <- data.frame(box.id.2, boxes.per.post.2, habitat.type.2, box.material.2, box.age.2, land.water.2)
colnames(Data.expected) <- c("box.id", "boxes.per.post", "habitat.type", "box.material", "box.age", "land.water")
CodePudding user response:
Your Data.expected
has a different number of rows as both of your input sets, but I think what you need is a coalesce
(on all duplicate columns) after your merge.
Try this:
left_join(Data2020, Data2021, by = "box.id") %>%
mutate(across(ends_with(".x"), ~ coalesce(cur_data()[[ sub("\\.x$", ".y", cur_column()) ]], .))) %>%
select(-ends_with(".y")) %>%
rename_with(.fn = ~ sub("\\.x$", "", .))
# box.id boxes.per.post habitat.type box.material box.age land.water
# 1 bf1 single Ephemeral Wetland wooden 1 land
# 2 bf2 single River/Creek wooden 3 water
# 3 rf1 duplex Pond wooden 4 land
# 4 rf3 single Ephemeral Wetland wooden 5 water
# 5 rf4 duplex Ephemeral Wetland wooden 6 water
# 6 rf6 single Ephemeral Wetland wooden 3 water
# 7 rf7 duplex River/Creek wooden 7 water
# 8 bf8 single Pond plastic 8 land
# 9 bf9 duplex Pond wooden 12 water
coalesce
is a function that returns the first values in its vectors that are non-NA
. It is vectorized, and accepts one or more vectors, so the following demonstrate some of its functionality:
coalesce(1, NA) # no change
# [1] 1
coalesce(NA, 1)
# [1] 1
coalesce(2, NA, 1) # no change
# [1] 2
coalesce(NA, NA, 1)
# [1] 1
In each case, it's using the first non-NA
value vector-wise.
Ways that this might need to be adjusted, based on things I don't know:
full_join
instead ofleft_join
, this might account for the additional columns in yourData.expected
- Wrap the order of what is being
coalesce
d. As it stands here, I'm prioritizing non-NA
values in RHS (Data2021
, joined-columns named*.x
) over the LHS (.y
), that was an arbitrary choice.
across
is a good way to iterate the same function/process over multiple columns. Since it only works on one column at a time (the internal expression/code only sees a vector of values), we can use the special functions of cur_column()
(for example, it might return the string "land.water.x"
), sub
to remove the ending literal ".y"
, and cur_data()[[ ... ]]
to retrieve the other column of the c("", ".y")
pair. (There are certainly other tricks to throw at this, including the use of .names="..."
in across
, but I didn't feel they added functionality here.)
CodePudding user response:
You may also try:
merge <- full_join(Data2021, Data2020) |>
group_by(box.id) |>
summarise(across(everything(), max, na.rm=TRUE)) |>
mutate(box.age = na_if(box.age, -Inf)