Home > Blockchain >  Merge dataframe in R without duplicates
Merge dataframe in R without duplicates

Time:10-17

I have a problem in R which is similar to this one:

merge pandas dataframe with key duplicates

It shouldn't be that hard to do this in R, but I just don't find the solution.

Thank you very much for your help!

CodePudding user response:

Create a sequence column in each data by 'key' and then do a full_join

library(dplyr)
library(data.table)
df1 %>%
    mutate(rn = rowid(key)) %>% 
   full_join(df2 %>% 
              mutate(rn = rowid(key))) %>%
   select(-rn)

-output

 key    A    B
1  K0   A0   B0
2  K1   A1   B1
3  K2   A2   B2
4  K2   A3   B3
5  K2   A4 <NA>
6  K3   A5   B4
7  K3 <NA>   B5
8  K4 <NA>   B6

data

df1 <- structure(list(key = c("K0", "K1", "K2", "K2", "K2", "K3"), A = c("A0", 
"A1", "A2", "A3", "A4", "A5")), class = "data.frame", 
row.names = c("0", 
"1", "2", "3", "4", "5"))

df2 <- structure(list(key = c("K0", "K1", "K2", "K2", "K3", "K3", "K4"
), B = c("B0", "B1", "B2", "B3", "B4", "B5", "B6")), 
class = "data.frame", row.names = c("0", 
"1", "2", "3", "4", "5", "6"))

CodePudding user response:

The answer of akrun is fantastic (see also comments): And I learned again some new stuff:

Most of all using rowid{data.table} which is a convenience function for generating a unique row ids within each group.

The dplyr only solution would need two steps for this:

library(dplyr)
df1 %>% 
  group_by(key) %>% 
  mutate(id = row_number()) %>% 
  full_join(df2 %>% 
              group_by(key) %>% 
              mutate(id=row_number())) %>% 
  select(-id)

  key   A     B    
  <chr> <chr> <chr>
1 K0    A0    B0   
2 K1    A1    B1   
3 K2    A2    B2   
4 K2    A3    B3   
5 K2    A4    NA   
6 K3    A5    B4   
7 K3    NA    B5   
8 K4    NA    B6 
  • Related