I have been using the %in%
operator for a long time since I knew about it.
However, I still don't understand how it works. At least, I thought that I knew how, but I always doubt about the order of the elements.
Here you have an example:
This is my dataframe:
df <- data.frame("col1"=c(1,2,3,4,30,21,320,123,4351,1234,3,0,43), "col2"=rep("something",13))
This how it looks
> df
col1 col2
1 1 something
2 2 something
3 3 something
4 4 something
5 30 something
6 21 something
7 320 something
8 123 something
9 4351 something
10 1234 something
11 3 something
12 0 something
13 43 something
Let's say I have a numerical vector:
myvector <- c(30,43,12,333334,14,4351,0,5,55,66)
And I want to check if all the numbers (or some) from my vector are in the previous dataframe. To do that, I always use %in%
.
I thought 2 approaches:
#common in both: 30, 4351, 0, 43
# are the numbers from df$col1 in my vector?
trial1 <- subset(df, df$col1 %in% myvector)
# are the numbers of the vector in df$col1?
trial2 <- subset(df, myvector %in% df$col1)
Both approaches make sense to me and they should give the same result. However, only the result from trial1 is okay.
> trial1
col1 col2
5 30 something
9 4351 something
12 0 something
13 43 something
What I don't understand is why the second way is giving me some common numbers and some which are not in the vector.
col1 col2
1 1 something
2 2 something
6 21 something
7 320 something
11 3 something
12 0 something
Could someone explain to me how `%in% operator works and why the second way gives me the wrong result?
Thanks very much in advance
Regards
CodePudding user response:
Answer is given, but a bit more detailed simply look at the %in% result
df$col1 %in% myvector
# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
The above one is correct as you subset df and keep the TRUE values, row 5, 9, 12, 13
versus
myvector %in% df$col1
# [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
This one goes wrong as you subset df and tell to keep 1, 2, 6, 7 and as length here is only 10 it recycles 11, 12, 13 as TRUE, TRUE, FALSE again so you get 11 and 12 in your subset as well