Home > front end >  Filtering on unique values doesn't work properly
Filtering on unique values doesn't work properly

Time:03-15

The Problem

I have 2 dataframes : cube and hub with variables called knb_nd and nd

|      cube$knb_nd    |
|---------------------|
|          01         |
|          02         |
|          05         |
|          05         |
|          NA         |
|          07         |
|      hub$nd         |
|---------------------|
|          01         |
|          02         |
|          02         |
|          01         |
|          NA         |

I want to have a subset of cube based on the knb_nd which are not present in hub

|      restult$nd     |
|---------------------|
|          05         |
|          07         |

What I tried

I tried to filter with base R using the unique() function on the dataframe, but when I search for a ND it still shows up in both dataframes. Same issue with the dplyr version.

# base R version
cube[!c(unique(cube$knb_nd) %in% unique(hub$nd)),]

# dplyr version
cube %>% 
  filter(!c(knb_nd %in% unique(hub$nd)))

I know there is probably a easy and obvious way to find it, but I can't seem to have it on my mind.

CodePudding user response:

Try:

library(dplyr)

result <- unique(anti_join(cube, hub, by = c("knb_nd" = "nd"))) %>% 
  rename(nd = knb_nd)
  nd
1  5
3  7

CodePudding user response:

There is an issue in the base R

cube[!c(unique(cube$knb_nd) %in% unique(hub$nd)),]

The unique(cube$knb_nb) could return length shorter than the original length of the column, thus the logical vector derived will be also be of shorter length creating an incorrect subset. Instead it would be

cube[!cube$knb_nd %in% unique(hub$nd),]
  • Related