Home > Back-end >  How to find common items in two data frames in R?
How to find common items in two data frames in R?

Time:12-23

Below is the sample code. Seems simple enough but what I find online makes this far too complex. I am simply wanting to create a third data frame that contains one column.. the areas that common to firstdf and seconddf.

 area1 <- c("001","005","007","009","011","013","015","017","019","021","023","025")
 Employment1 <- c(2,4,6,8,110,12,14,15,16,17,12,20)

 firstdf <- data.frame(area1,Employment1)

 area2 <- c("005","007","011","013","015","021","027","033")
 Employment2 <- c(100,101,102,103,104,111,321,522)

 seconddf <- data.frame(area2,Employment2)

 intended result
 thirddf =>
 

 area 
 005
 007
 011
 013
 015
 021

CodePudding user response:

We can use intersect:

data.frame(area = intersect(firstdf$area1, seconddf$area2))
#   area
# 1  005
# 2  007
# 3  011
# 4  013
# 5  015
# 6  021

though there's no need for the frame, it can easily be intersect(..) without the wrapping data.frame(.).

Candidly, though, it is common to want to preserve the other columns from either or both frames. Some thoughts:

### rows where `area` is in both
merge(firstdf, seconddf, by.x = "area1", by.y = "area2")
#   area1 Employment1 Employment2
# 1   005           4         100
# 2   007           6         101
# 3   011         110         102
# 4   013          12         103
# 5   015          14         104
# 6   021          17         111

### always keep first frame
merge(firstdf, seconddf, by.x = "area1", by.y = "area2", all.x = TRUE)
#    area1 Employment1 Employment2
# 1    001           2          NA
# 2    005           4         100
# 3    007           6         101
# 4    009           8          NA
# 5    011         110         102
# 6    013          12         103
# 7    015          14         104
# 8    017          15          NA
# 9    019          16          NA
# 10   021          17         111
# 11   023          12          NA
# 12   025          20          NA

### always keep second frame
merge(firstdf, seconddf, by.x = "area1", by.y = "area2", all.y = TRUE)
#   area1 Employment1 Employment2
# 1   005           4         100
# 2   007           6         101
# 3   011         110         102
# 4   013          12         103
# 5   015          14         104
# 6   021          17         111
# 7   027          NA         321
# 8   033          NA         522

### keep all rows, NAs abound
merge(firstdf, seconddf, by.x = "area1", by.y = "area2", all = TRUE)
#    area1 Employment1 Employment2
# 1    001           2          NA
# 2    005           4         100
# 3    007           6         101
# 4    009           8          NA
# 5    011         110         102
# 6    013          12         103
# 7    015          14         104
# 8    017          15          NA
# 9    019          16          NA
# 10   021          17         111
# 11   023          12          NA
# 12   025          20          NA
# 13   027          NA         321
# 14   033          NA         522

There are dplyr-variants of the merge(..) calls above that might be easier to read if you're already learning/using that dialect. The joining is over-kill if you never want all of the other columns, but I offer it in case that's the direction you're headed.

CodePudding user response:

library(dplyr)

tibble(intersect(firstdf$area1, seconddf$area2))

1 005                                       
2 007                                       
3 011                                       
4 013                                       
5 015                                       
6 021 
  • Related