I have two dataframes that I'd like to conditionally merge.
df1:
Location `Sub Location` Date n
<chr> <chr> <chr> <int>
1 AREA 1 Bore Drain 2014-04-21 15
2 AREA 1 Bore Drain 2014-04-23 2
3 AREA 1 Bore Drain 2014-04-24 6
4 AREA 1 Bore Drain 2015-04-04 6
5 AREA 1 Bore Drain 2015-04-08 8
6 AREA 1 Bore Drain 2015-04-09 9
7 AREA 1 Bore Drain 2016-03-25 31
8 AREA 1 Large Dam 2016-03-26 7
9 AREA 1 Bore Drain 2016-04-01 2
10 AREA 1 Bore Drain 2016-04-02 6
and df2:
Location `Sub Location` StartDate EndDate Totals
<chr> <chr> <chr> <chr> <dbl>
1 AREA 1 Homestead 2013-03-29 2013-03-30 0
2 AREA 1 Bore Drain 2014-04-21 2014-04-21 0
3 AREA 1 Homestead 2014-04-17 2014-04-18 0
4 AREA 1 Cottage 2014-04-21 2014-04-22 0
5 AREA 1 Bore Drain 2014-04-23 2014-04-24 0
6 AREA 1 Bore Drain 2015-04-03 2015-04-04 0
7 AREA 1 Homestead 2015-04-03 2015-04-04 0
8 AREA 1 Bore Drain 2015-04-08 2015-04-09 0
9 AREA 1 Cottage 2015-04-08 2015-04-09 0
10 AREA 1 Homestead 2016-03-25 2016-03-25 0
What I'd like to do is check for each entry in df1, if Date
matches either StartDate
OR EndDate
, AND the location
and Sub Location
are the same. If this is the case, I'd like n
in df1 to be added to Totals
in df2
I've tried using ifelse()
or going through every entry in a for loop, but I haven't been able to get it working. Any suggestions are appreciated :)
CodePudding user response:
Merge twice, first matching Date
to StartDate
, then to EndDate
. For cases where StartDate == EndDate
, it will merge in the same row from df1
twice, so delete the second n
value. Then add the two merged-in n
columns to Totals
.
library(dplyr)
merged <- df2 %>%
left_join(df1, by = c("StartDate" = "Date", "Location", "SubLocation")) %>%
left_join(df1, by = c("EndDate" = "Date", "Location", "SubLocation")) %>%
mutate(
n.y = ifelse(StartDate == EndDate, NA, n.y),
Totals = rowSums(across(c(Totals, n.x, n.y)), na.rm = TRUE),
n.x = NULL,
n.y = NULL
)
merged
Output:
# A tibble: 10 × 5
Location SubLocation StartDate EndDate Totals
<chr> <chr> <date> <date> <dbl>
1 AREA 1 Homestead 2013-03-29 2013-03-30 0
2 AREA 1 BoreDrain 2014-04-21 2014-04-21 15
3 AREA 1 Homestead 2014-04-17 2014-04-18 0
4 AREA 1 Cottage 2014-04-21 2014-04-22 0
5 AREA 1 BoreDrain 2014-04-23 2014-04-24 8
6 AREA 1 BoreDrain 2015-04-03 2015-04-04 6
7 AREA 1 Homestead 2015-04-03 2015-04-04 0
8 AREA 1 BoreDrain 2015-04-08 2015-04-09 17
9 AREA 1 Cottage 2015-04-08 2015-04-09 0
10 AREA 1 Homestead 2016-03-25 2016-03-25 0
CodePudding user response:
Something like this
df2$Totals <- rowSums(t(apply(df2, 1, function(x)
ifelse(x["Location"] == df1$Location &
x["`Sub Location`"] == df1$"`Sub Location`" &
(x["StartDate"] == df1$Date | x["EndDate"] == df1$Date),
df1$n, 0))))
df2
Location `Sub Location` StartDate EndDate Totals
1 AREA 1 Homestead 2013-03-29 2013-03-30 0
2 AREA 1 Bore Drain 2014-04-21 2014-04-21 15
3 AREA 1 Homestead 2014-04-17 2014-04-18 0
4 AREA 1 Cottage 2014-04-21 2014-04-22 0
5 AREA 1 Bore Drain 2014-04-23 2014-04-24 8
6 AREA 1 Bore Drain 2015-04-03 2015-04-04 6
7 AREA 1 Homestead 2015-04-03 2015-04-04 0
8 AREA 1 Bore Drain 2015-04-08 2015-04-09 17
9 AREA 1 Cottage 2015-04-08 2015-04-09 0
10 AREA 1 Homestead 2016-03-25 2016-03-25 0
Data
df1 <- structure(list(Location = c("AREA 1", "AREA 1", "AREA 1", "AREA 1",
"AREA 1", "AREA 1", "AREA 1", "AREA 1", "AREA 1", "AREA 1"),
``Sub Location`` = c("Bore Drain", "Bore Drain", "Bore Drain",
"Bore Drain", "Bore Drain", "Bore Drain", "Bore Drain", "Large Dam",
"Bore Drain", "Bore Drain"), Date = c("2014-04-21", "2014-04-23",
"2014-04-24", "2015-04-04", "2015-04-08", "2015-04-09", "2016-03-25",
"2016-03-26", "2016-04-01", "2016-04-02"), n = c(15L, 2L,
6L, 6L, 8L, 9L, 31L, 7L, 2L, 6L)), class = "data.frame", row.names = c(NA,
-10L))
df2 <- structure(list(Location = c("AREA 1", "AREA 1", "AREA 1", "AREA 1",
"AREA 1", "AREA 1", "AREA 1", "AREA 1", "AREA 1", "AREA 1"),
``Sub Location`` = c("Homestead", "Bore Drain", "Homestead",
"Cottage", "Bore Drain", "Bore Drain", "Homestead", "Bore Drain",
"Cottage", "Homestead"), StartDate = c("2013-03-29", "2014-04-21",
"2014-04-17", "2014-04-21", "2014-04-23", "2015-04-03", "2015-04-03",
"2015-04-08", "2015-04-08", "2016-03-25"), EndDate = c("2013-03-30",
"2014-04-21", "2014-04-18", "2014-04-22", "2014-04-24", "2015-04-04",
"2015-04-04", "2015-04-09", "2015-04-09", "2016-03-25"),
Totals = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
-10L), class = "data.frame")
CodePudding user response:
Personally I prefer case_when
. To me it is easier to understand. First full_join
the 2 dataframes, then use case_when
to the conditions you specified. Then distinct
to remove duplicates
df1 %>%
full_join(df2) %>%
mutate(n = as.numeric(n),
Totals = as.numeric(Totals)) %>%
mutate(Totals = case_when(
Date == StartDate | Date == EndDate ~ Totals n,
TRUE ~ Totals
)) %>%
select(-n, -Date) %>%
drop_na() %>%
distinct(Location,Sub.Location,StartDate,EndDate, .keep_all = TRUE)