Home > Net >  dplyr::filter function generates wrong results
dplyr::filter function generates wrong results

Time:12-29

I'm using dplyr::filter function to filter data based on 3 variables Sex,Patient.Age,Country.where.Event.occurred, the first code section generates correct results, and the second code section generates wrong results. However, both code sections have the same expression from my point of view, so I'm confused why the results are different.

> data
# A tibble: 1,360 × 3
   Sex    Patient.Age Country.where.Event.occurred
   <chr>  <chr>       <chr>                       
 1 Female 12 YR       US                          
 2 Female 16 YR       KW                          
 3 Female 16 YR       US                          
 4 Female 16 YR       US                          
 5 Female 16 YR       US                          
 6 Female 16 YR       US                          
 7 Female 17 YR       ES                          
 8 Female 17 YR       ES                          
 9 Female 17 YR       GB                          
10 Female 19 YR       CA                          
# … with 1,350 more rows

# unique combination of 3 variables
> key <- data %>% 
    distinct(Sex, Patient.Age,Country.where.Event.occurred)
> key
# A tibble: 399 × 3
   Sex    Patient.Age Country.where.Event.occurred
   <chr>  <chr>       <chr>                       
 1 Female 12 YR       US                          
 2 Female 16 YR       KW                          
 3 Female 16 YR       US                          
 4 Female 17 YR       ES                          
 5 Female 17 YR       GB                          
 6 Female 19 YR       CA                          
 7 Female 19 YR       US                          
 8 Female 2 YR        US                          
 9 Female 26 YR       US                          
10 Female 28 YR       US                          
# … with 389 more rows

> data %>%
    filter(Sex == key[3,]$Sex,
           Patient.Age == key[3,]$Patient.Age,
           Country.where.Event.occurred == key[3,]$Country.where.Event.occurred)
# A tibble: 4 × 3
  Sex    Patient.Age Country.where.Event.occurred
  <chr>  <chr>       <chr>                       
1 Female 16 YR       US                          
2 Female 16 YR       US                          
3 Female 16 YR       US                          
4 Female 16 YR       US 
> Sex <- key[3,]$Sex
> Sex
[1] "Female"
> Age <- key[3,]$Patient.Age
> Age
[1] "16 YR"
> Country <- key[3,]$Country.where.Event.occurred
> Country
[1] "US"
> data %>%
    filter(Sex == Sex,
           Patient.Age == Age,
           Country.where.Event.occurred == Country)
# A tibble: 7 × 3
  Sex    Patient.Age Country.where.Event.occurred
  <chr>  <chr>       <chr>                       
1 Female 16 YR       US                          
2 Female 16 YR       US                          
3 Female 16 YR       US                          
4 Female 16 YR       US                          
5 Male   16 YR       US                          
6 Male   16 YR       US                          
7 Male   16 YR       US         

CodePudding user response:

The problem in the second example might be the line filter(Sex == Sex....

The term Sex in both the left and right right are being interpreted as the Sex variable within the dataset. It's always going to match itself, therefore that part will always be true.

I think you're intending the left-hand side to be "Female" (judging from your pattern with the other two variables.


To learn about this more deeply, I suggest reading the Programming with dplyr vignette a few times. At least for me, there is also a nugget or two I learn/relearn each time. For your specific question, the "Data masking" section is relevant.

The key idea behind data masking is that it blurs the line between the two different meanings of the word “variable”:

  • env-variables are “programming” variables that live in an environment. They are usually created with <-.

  • data-variables are “statistical” variables that live in a data frame. They usually come from data files (e.g. .csv, .xls), or are created manipulating existing variables.

...

I think this blurring of the meaning of “variable” is a really nice feature...

Unfortunately, this benefit does not come for free...

  • Related