Home > Software engineering >  sum values ​of a joined table between two dates with data table
sum values ​of a joined table between two dates with data table

Time:11-09

We want to complete a DT1 table, by retrieving the sum of scores contained in another DT2 table. By adding to DT1 counters A and B from DT2, making sure that date of DT1 is between the start and end date of DT2. How is this achievable with data table?

Initial data

Table DT1

Date        City   Squad
2022/1/5    NY     a
2022/1/6    NY     b
2022/1/9    LA     b
2022/1/7    NY     a

Table DT2

Date_start  Date_end   City  Squad   Counter_A   Counter_B
2022/1/1    2022/1/11   NY    a        1           1 
2022/1/2    2022/1/5    NY    a        2           3 
2022/1/1    2022/1/8    LA    b        2           1 
2022/1/1    2022/1/8    NY    b        1           3 

Expected result

Date        City   Squad   Counter_A   Counter_B
2022/1/5     NY       a      3          4 
2022/1/6     NY       b      1          3 
2022/1/9     LA       b      0          0 
2022/1/7     NY       a      1          1 

Init data code

require(data.table)

DT1 <- data.table(
  work_day = c("2022/1/5","2022/1/6","2022/1/9","2022/1/7"),
  city= c("NY", "NY","LA", "NY"),
  squad=c("a","b","b","a")
)

DT2 <- data.table(
  date_start = c("2022/1/1","2022/1/2","2022/1/1","2022/1/1"),
  date_end = c("2022/1/11","2022/1/5","2022/1/8","2022/1/8"),
  city= c("NY","NY", "LA", "NY"),
  squad=c("a","a","b","b"),
  count_A=c(1,2,2,1),
  count_B=c(1,3,1,3)
)

Code attempt

I want to do in data table something like the following code in dplyr:

if(DT1$city == DT2$city & DT1$squad == DT2$squad &
   DT1$date %in% seq(DT2$date_start,DT2$date_end))
{
   DT1$counter_A=DT2$counter_A
   DT1$counter_B=DT2$counter_B
} else {
    "Nothing"
}

CodePudding user response:

This is called a "non-equi join," in the sense that you want to match workday to the range of start-end in the other table. You should find lots of answers on stack overflow like this one: Non-equi join of dates using data table.

What's perhaps a little unusual is showing the third line when it is not matched in the other table. If that is in fact desired you could fix it by wrapping the result in another join. Here's a dplyr approach. (the CRAN version of dplyr does not yet support non-equi joins but looks like it will in the next update. We can get around that by doing a full join and filtering.)

# you'll want to fix these regardless, since you can't calculate on text dates
DT1$work_day = as.Date(DT1$work_day)
DT2$date_start = as.Date(DT2$date_start)
DT2$date_end = as.Date(DT2$date_end)


DT1 %>%
  left_join(DT1 %>%
    left_join(DT2, by = c("city", "squad")) %>%
    filter(work_day >= date_start, work_day <= date_end) %>%
    group_by(work_day, squad) %>%
    summarize(across(count_A:count_B, sum))) %>%
  mutate(across(count_A:count_B, ~coalesce(.x, 0)))

Result

`summarise()` has grouped output by 'work_day'. You can override using the `.groups`
argument.
Joining, by = c("work_day", "squad")
     work_day city squad count_A count_B
1: 2022-01-05   NY     a       3       4
2: 2022-01-06   NY     b       1       3
3: 2022-01-09   LA     b       0       0
4: 2022-01-07   NY     a       1       1

CodePudding user response:

Just as an addition to Jon's answer:

Provided you have changed the columns with date information to an appropriate date type like he suggested, you can do the following within data.table.

setkey(DT1, "city", "squad")
setkey(DT2, "city", "squad")
dt.result <- DT1[DT2]
dt.result[, flag:=1*(work_day >= date_start & work_day <= date_end)]
dt.result[, .(ct_A=sum(count_A*flag), ct_B=sum(count_B*flag)), by=c("work_day", "city", "squad")]

     work_day city squad ct_A ct_B
1: 2022-01-09   LA     b    0    0
2: 2022-01-05   NY     a    3    4
3: 2022-01-07   NY     a    1    1
4: 2022-01-06   NY     b    1    3

The "ugly" construction with the flag column is only necessary in order to list combinations with zero score.

  • Related