I'm using dplyr::filter
function to filter data based on 3 variables Sex
,Patient.Age
,Country.where.Event.occurred
, the first code section generates correct results, and the second code section generates wrong results. However, both code sections have the same expression from my point of view, so I'm confused why the results are different.
> data
# A tibble: 1,360 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 12 YR US
2 Female 16 YR KW
3 Female 16 YR US
4 Female 16 YR US
5 Female 16 YR US
6 Female 16 YR US
7 Female 17 YR ES
8 Female 17 YR ES
9 Female 17 YR GB
10 Female 19 YR CA
# … with 1,350 more rows
# unique combination of 3 variables
> key <- data %>%
distinct(Sex, Patient.Age,Country.where.Event.occurred)
> key
# A tibble: 399 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 12 YR US
2 Female 16 YR KW
3 Female 16 YR US
4 Female 17 YR ES
5 Female 17 YR GB
6 Female 19 YR CA
7 Female 19 YR US
8 Female 2 YR US
9 Female 26 YR US
10 Female 28 YR US
# … with 389 more rows
> data %>%
filter(Sex == key[3,]$Sex,
Patient.Age == key[3,]$Patient.Age,
Country.where.Event.occurred == key[3,]$Country.where.Event.occurred)
# A tibble: 4 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 16 YR US
2 Female 16 YR US
3 Female 16 YR US
4 Female 16 YR US
> Sex <- key[3,]$Sex
> Sex
[1] "Female"
> Age <- key[3,]$Patient.Age
> Age
[1] "16 YR"
> Country <- key[3,]$Country.where.Event.occurred
> Country
[1] "US"
> data %>%
filter(Sex == Sex,
Patient.Age == Age,
Country.where.Event.occurred == Country)
# A tibble: 7 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 16 YR US
2 Female 16 YR US
3 Female 16 YR US
4 Female 16 YR US
5 Male 16 YR US
6 Male 16 YR US
7 Male 16 YR US
CodePudding user response:
The problem in the second example might be the line filter(Sex == Sex...
.
The term Sex
in both the left and right right are being interpreted as the Sex
variable within the dataset. It's always going to match itself, therefore that part will always be true.
I think you're intending the left-hand side to be "Female" (judging from your pattern with the other two variables.
To learn about this more deeply, I suggest reading the Programming with dplyr vignette a few times. At least for me, there is also a nugget or two I learn/relearn each time. For your specific question, the "Data masking" section is relevant.
The key idea behind data masking is that it blurs the line between the two different meanings of the word “variable”:
env-variables are “programming” variables that live in an environment. They are usually created with <-.
data-variables are “statistical” variables that live in a data frame. They usually come from data files (e.g. .csv, .xls), or are created manipulating existing variables.
...
I think this blurring of the meaning of “variable” is a really nice feature...
Unfortunately, this benefit does not come for free...