Home > Software engineering >  Left join by group and condition (`tidyverse` or `data.table`)
Left join by group and condition (`tidyverse` or `data.table`)

Time:08-31

I have a very large data frame that includes integer columns state and state_cyclen. Every row is a gameframe, while state describes the state a game is in at that frame and state_cyclen is coded to indicate n occurrence of that state (it is basically data.table::rleid(state)). Conditioning on state and cycling by state_cyclen I need to import several columns from other definitions data frames. Definition data frames store properties about state and their row ordering informs on the way these properties are cycled throughout the game (players encounter each game state many times).

A minimal example of the long data that should be left joined:

data <- data.frame(
  state        = c(1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, 3, 4, 4, 3, 3),
  state_cyclen = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 4, 4)
)

data 
#>    state state_cyclen
#> 1      1            1
#> 2      1            1
#> 3      2            1
#> 4      2            1
#> 5      3            1
#> 6      3            1
#> 7      1            2
#> 8      1            2
#> 9      2            2
#> 10     2            2
#> 11     3            2
#> 12     3            2
#> 13     2            3
#> 14     2            3
#> 15     3            3
#> 16     3            3
#> 17     3            3
#> 18     4            1
#> 19     4            1
#> 20     3            4
#> 21     3            4

Minimal example for definition data frames storing the ordering:

def_one <- data.frame(
  prop = letters[1:3],
  others = LETTERS[1:3]
)  

def_two <- data.frame(
  prop = letters[4:10],
  others = LETTERS[4:10]
) 

def_three <- data.frame(
  prop = letters[11:12],
  others = LETTERS[11:12]
) 

I have a solution written in base R that gives the desired output, but it's neither very readable, nor probably very efficient.

# Add empty columns
data$prop <- NA
data$others <- NA

# Function that recycles numeric vector bounded by a upper limit 
bounded_vec_recyc <- function(vec, n) if(n == 1) vec else (vec - 1) %% n   1

# My solution
vec_pos_one <- data[data[, "state"] == 1, ]$state_cyclen 
vec_pos_one <- bounded_vec_recyc(vec_pos_one, n = nrow(def_one))
data[data[, "state"] == 1, ][, c("prop", "others")] <- def_one[vec_pos_one,]
  

vec_pos_two <- data[data[, "state"] == 2, ]$state_cyclen 
vec_pos_two <- bounded_vec_recyc(vec_pos_two, n = nrow(def_two))
data[data[, "state"] == 2, ][, c("prop", "others")] <- def_two[vec_pos_two,]


vec_pos_three <- data[data[, "state"] == 3, ]$state_cyclen 
vec_pos_three <- bounded_vec_recyc(vec_pos_three, n = nrow(def_three))
data[data[, "state"] == 3, ][, c("prop", "others")] <- def_three[vec_pos_three,]

data
#>    state state_cyclen prop others
#> 1      1            1    a      A
#> 2      1            1    a      A
#> 3      2            1    d      D
#> 4      2            1    d      D
#> 5      3            1    k      K
#> 6      3            1    k      K
#> 7      1            2    b      B
#> 8      1            2    b      B
#> 9      2            2    e      E
#> 10     2            2    e      E
#> 11     3            2    l      L
#> 12     3            2    l      L
#> 13     2            3    f      F
#> 14     2            3    f      F
#> 15     3            3    k      K
#> 16     3            3    k      K
#> 17     3            3    k      K
#> 18     4            1 <NA>   <NA>
#> 19     4            1 <NA>   <NA>
#> 20     3            4    l      L
#> 21     3            4    l      L

Created on 2022-08-30 with reprex v2.0.2

TLDR: As you can see, I am basically trying to merge one by one these definition data frames to the main data frame on corresponding state by recycling the rows of the definition data frame while retaining their order, using the state_cyclen column to keep track of occurrences of each state throughout the game.

Is there a way to do this within the tidyverse or data.table that is faster or at least easier to read? I need this to be quite fast as I have many such gameframe files (in the hundreds) and they are lengthy (hundreds of thousands of rows).

P.S. Not sure if title is adequate for the operations I am doing, as I can imagine multiple ways of implementation. Edits on it are welcome.

CodePudding user response:

Here, I make a lookup table combining the three sources. Then I join the data with the number of rows for each state, modify the state_cyclen in data using modulo with that number to be within the lookup range, then join.

library(tidyverse)
def <- bind_rows(def_one, def_two, def_three, .id = "state") %>%
  mutate(state = as.numeric(state))  %>%
  group_by(state) %>%
  mutate(state_cyclen_adj = row_number()) %>%
  ungroup()

data %>%
  left_join(def %>% count(state)) %>%
  # eg for row 15 we change 3 to 1 since the lookup table only has 2 rows
  mutate(state_cyclen_adj = (state_cyclen - 1) %% n   1) %>%
  left_join(def)


Joining, by = "state"
Joining, by = c("state", "state_cyclen_adj")
   state state_cyclen  n state_cyclen_adj prop others
1      1            1  3                1    a      A
2      1            1  3                1    a      A
3      2            1  7                1    d      D
4      2            1  7                1    d      D
5      3            1  2                1    k      K
6      3            1  2                1    k      K
7      1            2  3                2    b      B
8      1            2  3                2    b      B
9      2            2  7                2    e      E
10     2            2  7                2    e      E
11     3            2  2                2    l      L
12     3            2  2                2    l      L
13     2            3  7                3    f      F
14     2            3  7                3    f      F
15     3            3  2                1    k      K
16     3            3  2                1    k      K
17     3            3  2                1    k      K
18     4            1 NA               NA <NA>   <NA>
19     4            1 NA               NA <NA>   <NA>
20     3            4  2                2    l      L
21     3            4  2                2    l      L

CodePudding user response:

Here is a data.table solution. Not sure it is easier to read, but pretty sure it is more efficient:

library(data.table)

dt <- rbind(setDT(def_one)[,state := 1],
            setDT(def_two)[,state := 2],
            setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]

data <- setDT(data)
data[dt[,.N,by = state],
     state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
     on = "state",
     by = .EACHI]

dt[data,on = c("state","state_cyclen")]
    prop others state state_cyclen
 1:    a      A     1            1
 2:    a      A     1            1
 3:    d      D     2            1
 4:    d      D     2            1
 5:    k      K     3            1
 6:    k      K     3            1
 7:    b      B     1            2
 8:    b      B     1            2
 9:    e      E     2            2
10:    e      E     2            2
11:    l      L     3            2
12:    l      L     3            2
13:    f      F     2            3
14:    f      F     2            3
15:    k      K     3            1
16:    k      K     3            1
17:    k      K     3            1
18: <NA>   <NA>     4            1
19: <NA>   <NA>     4            1
20:    l      L     3            2
21:    l      L     3            2
    prop others state state_cyclen

By step: I bind the def_one, def_two and def_three dataframes to create a data.table with the variable you need to merge

dt <- rbind(setDT(def_one)[,state := 1],
            setDT(def_two)[,state := 2],
            setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]

In case you want to merge a lot of dataframes, you can use rbindlist and a list of data.tables.

I then modify your state_cyclen in data to do the same recycling than you:

dt[,.N,by = state]

   state N
1:     1 3
2:     2 7
3:     3 2

gives the lengths you use to define your recycling.

data[dt[,.N,by = state],
     state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
     on = "state",
     by = .EACHI]

I use the by = .EACHI to modify the variable for each group during the merge, using the N variable from dt[,.N,by = state]

Then I just have to do the left join:

dt[data,on = c("state","state_cyclen")]
  • Related