Home > Net >  Joining two data frames by sequential column matching
Joining two data frames by sequential column matching

Time:09-28

Let's say I have two data frames like so:

test_1 = game_df = read.table(text = "id winner_name
1 Jon
2 Bob
3 Lucas
4 Marcus
5 Toad
6 Donkey", header = T)

test_2 = game_df = read.table(text = "id_1 id_2 loser_name
9 1 Henry
2 2 George
3 3 Bagel
4 4 Cat
5 5 Giraffe
7 6 Monkey", header = T)

What I want to do is to first is to left_join test_1 on the id = id_=1 match like so:

test_1 %>% left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1'))

There will be certain NA matches (John, and Donkey)

  id winner_name loser_name
1  1         Jon       <NA>
2  2         Bob     George
3  3       Lucas      Bagel
4  4      Marcus        Cat
5  5        Toad    Giraffe
6  6      Donkey       <NA>

And then I'd like to use id_2 as the matching column to fill the NA's, so I currently do something like this:

test_1 %>% left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1')) %>%
  left_join(test_2 %>% select(id_2, loser_name), by = c('id' = 'id_2'))

  id winner_name loser_name.x loser_name.y
1  1         Jon         <NA>        Henry
2  2         Bob       George       George
3  3       Lucas        Bagel        Bagel
4  4      Marcus          Cat          Cat
5  5        Toad      Giraffe      Giraffe
6  6      Donkey         <NA>       Monkey

This seems to work, but it generates a bunch of duplicate columns with x and y suffixes. In my actual dataset, I have to go through this conditional matching approach through a ton of id matches, so it will generate a large number of duplicate columns, that I then have to manually deselect and rename.

The issue is that there are hundreds of columns in the actual test_2 data.frame (loser_name, loser_country, loser_elo, loser_record, loser_win_rate), etc., so I'd need to manually specify the name and the columns to coalesce for each one. Furthermore, because I do this sequential id matching with more than one id, I'll have loser_name.x, loser_name.y, loser_name.z, and I don't know in advance how many suffixes there will be for each column.

Is there a simpler approach to this?

CodePudding user response:

We can do a coalesce at the end

library(dplyr)
test_1 %>%
   left_join(test_2 %>% select(id_1, loser_name), by = c('id' = 'id_1')) %>%  
   left_join(test_2 %>% select(id_2, loser_name), by = c('id' = 'id_2')) %>%
   transmute(id, winner_name, loser_name = coalesce(loser_name.x, loser_name.y))

-output

   id winner_name loser_name
1  1         Jon      Henry
2  2         Bob     George
3  3       Lucas      Bagel
4  4      Marcus        Cat
5  5        Toad    Giraffe
6  6      Donkey     Monkey

CodePudding user response:

You can try joining test_1 on a molten (=long) version of test_2. Only workt if the order of the molten test_2 resembles the search-order of your id's.. Now you can have id_1, id_2, ..., id_100

library(data.table)
#make them data.tables
setDT(test_1)
setDT(test_2)

#join on molten set
test_1[melt(test_2, id.vars = "loser_name"), 
       loser_name := i.loser_name, 
       on = .(id = value)]

#    id winner_name loser_name
# 1:  1         Jon      Henry
# 2:  2         Bob     George
# 3:  3       Lucas      Bagel
# 4:  4      Marcus        Cat
# 5:  5        Toad    Giraffe
# 6:  6      Donkey     Monkey
  • Related