I'm working with a tibble dataframe and I need to subset it, lefting only the lines which the variable "code" match the vector of chosen codes. Recently I'm entering the dplyr
world, so I'm trying to use tibbles and %>%
for it.
For example, let code
and df
be:
v_code <-c(314480505000001, 314480505000002, 314480505000003,
314480505000004, 314480505000005, 314480505000006,
314480505000007, 314480505000008, 314480505000009,
314480505000010)
code<-tibble(code=v_code)
v_df <- c(314470605000018, 314470605000019, 314470605000020,
314470605000021, 314470605000022, 314480505000001,
314480505000002, 314480505000003, 314480505000004,
314480505000005, 314480505000006, 314480505000007,
314480505000008, 314480505000009, 314480505000010)
df <- tibble (v_df, da = 1:15)
df2 <- data.frame(v_df, da = 1:15)
So, if I try:
df2[which(df2$v_df %in% v_code),]
I get what I want, but using tibbles:
df %>% filter(v_df %in% code)
I got
# A tibble: 0 x 2
# ... with 2 variables: v_df <dbl>, da <int>
In the same way, typing which(tibble(v_df) %in% code)
returns integer(0)
.
I tried convert code
and v_df
into characters, but I doesn't work either.
I would thank any help.
CodePudding user response:
library(tidyverse)
v_code <-c(314480505000001, 314480505000002, 314480505000003,
314480505000004, 314480505000005, 314480505000006,
314480505000007, 314480505000008, 314480505000009,
314480505000010)
code<-tibble(code=v_code)
v_df <- c(314470605000018, 314470605000019, 314470605000020,
314470605000021, 314470605000022, 314480505000001,
314480505000002, 314480505000003, 314480505000004,
314480505000005, 314480505000006, 314480505000007,
314480505000008, 314480505000009, 314480505000010)
df <- tibble (v_df, da = 1:15)
df2 <- data.frame(v_df, da = 1:15)
df %>% filter(v_df %in% code$code)
output
# A tibble: 10 x 2
v_df da
<dbl> <int>
1 3.14e14 6
2 3.14e14 7
3 3.14e14 8
4 3.14e14 9
5 3.14e14 10
6 3.14e14 11
7 3.14e14 12
8 3.14e14 13
9 3.14e14 14
10 3.14e14 15
CodePudding user response:
I have tried to address the question here, so hopefully it helps you understand how %in%
works, and when and how it can be used.
The reason your example doesn't work is because code
is a tibble. If code
is a vector (v_code
), then it will work. For example:
df %>% filter(v_df %in% v_code)
# # A tibble: 10 x 2
# v_df da
# <dbl> <int>
# 1 3.14e14 6
# 2 3.14e14 7
# 3 3.14e14 8
# 4 3.14e14 9
# 5 3.14e14 10
# 6 3.14e14 11
# 7 3.14e14 12
# 8 3.14e14 13
# 9 3.14e14 14
# 10 3.14e14 15
See ?`%in%`
for more details, but in short:
x %in% table
x = vector or NULL: the values to be matched. Long vectors are supported
table = vector or NULL: the values to be matched against. Long vectors are not supported.
When filtering using %in%
, in this example you are supplying a column as x
, which is treated as a vector and then compared to the table
vector you supply.
I guess this could be a little confusing seeing as you can supply a tibble column as the first argument, but the second argument has to be a vector. To make it more obvious why this wouldn't work, imagine if the code
tibble had multiple columns - which column(s) would %in%
use for it's comparison?
However, it looks like you both x
and table
can be columns, if they are in the same tibble. For example, let's make a tibble with 2 columns which we want to compare:
tibble(
x = letters[1:10],
y = letters[c(1:5, 8:6, 9:10)]
) %>%
mutate(
match = x == y
) %>%
{. ->> my_tibb}
my_tibb
# # A tibble: 10 x 3
# x y match
# <chr> <chr> <lgl>
# 1 a a TRUE
# 2 b b TRUE
# 3 c c TRUE
# 4 d d TRUE
# 5 e e TRUE
# 6 f h FALSE
# 7 g g TRUE
# 8 h f FALSE
# 9 i i TRUE
# 10 j j TRUE
Now, we use %in%
to see if x
matches y
. Of course this could be done using x == y
or match == TRUE
for this example, but this demonstrates how it still works.
my_tibb %>%
rowwise %>%
filter(
x %in% y
)
# # A tibble: 8 x 3
# # Rowwise:
# x y match
# <chr> <chr> <lgl>
# 1 a a TRUE
# 2 b b TRUE
# 3 c c TRUE
# 4 d d TRUE
# 5 e e TRUE
# 6 g g TRUE
# 7 i i TRUE
# 8 j j TRUE
Alternatively, if your table
object was a column which itself contained a vector, then %in%
can still be used. In this example, we make a column z_list
which is a list of 6 random letters. This is coerced to a string (z_string
) just so we can see which letters they are in the tibble console preview:
set.seed(3)
tibble(
x = letters[1:10]
) %>%
rowwise %>%
mutate(
z_list = list(runif(6, min = 1, max = 26) %>%
round %>%
letters[.]),
z_string = str_c(z_list, collapse = ', ')
) %>%
{. ->> my_tibb2}
my_tibb2
# # A tibble: 10 x 3
# # Rowwise:
# x z_list z_string
# <chr> <list> <chr>
# 1 a <chr [6]> e, u, k, i, p, p
# 2 b <chr [6]> d, h, o, q, n, n
# 3 c <chr [6]> n, o, w, v, d, s
# 4 d <chr [6]> w, h, g, a, d, c
# 5 e <chr [6]> g, u, p, x, o, t
# 6 f <chr [6]> j, j, e, l, g, i
# 7 g <chr [6]> w, f, o, f, h, u
# 8 h <chr [6]> e, o, k, h, b, d
# 9 i <chr [6]> i, u, g, f, w, z
# 10 j <chr [6]> v, x, m, g, d, h
Then we can use %in%
to see when x
is in the z_list
column:
# # A tibble: 3 x 3
# # Rowwise:
# x z_list z_string
# <chr> <list> <chr>
# 1 d <chr [6]> w, h, g, a, d, c
# 2 h <chr [6]> e, o, k, h, b, d
# 3 i <chr [6]> i, u, g, f, w, z
x %in% z_string
doesn't work because z_string
is a character string of several letters (like a word), so a single-letter string (x
) won't match it.
If you did want to see if a letter (x
) was in a word, you would have to split the word into separate letters using str_extract_all()
and make it into a list, like below.
my_tibb2 %>%
mutate(
word = str_replace_all(z_string, ', ', ''),
word_list = str_extract_all(word, boundary('character'))
) %>%
{. ->> my_tibb3}
# # A tibble: 10 x 5
# # Rowwise:
# x z_list z_string word word_list
# <chr> <list> <chr> <chr> <list>
# 1 a <chr [6]> e, u, k, i, p, p eukipp <chr [6]>
# 2 b <chr [6]> d, h, o, q, n, n dhoqnn <chr [6]>
# 3 c <chr [6]> n, o, w, v, d, s nowvds <chr [6]>
# 4 d <chr [6]> w, h, g, a, d, c whgadc <chr [6]>
# 5 e <chr [6]> g, u, p, x, o, t gupxot <chr [6]>
# 6 f <chr [6]> j, j, e, l, g, i jjelgi <chr [6]>
# 7 g <chr [6]> w, f, o, f, h, u wfofhu <chr [6]>
# 8 h <chr [6]> e, o, k, h, b, d eokhbd <chr [6]>
# 9 i <chr [6]> i, u, g, f, w, z iugfwz <chr [6]>
# 10 j <chr [6]> v, x, m, g, d, h vxmgdh <chr [6]>
Then, we can use filter()
as we did before:
my_tibb3 %>%
filter(
x %in% word_list
)
# # A tibble: 3 x 5
# # Rowwise:
# x z_list z_string word word_list
# <chr> <list> <chr> <chr> <list>
# 1 d <chr [6]> w, h, g, a, d, c whgadc <chr [6]>
# 2 h <chr [6]> e, o, k, h, b, d eokhbd <chr [6]>
# 3 i <chr [6]> i, u, g, f, w, z iugfwz <chr [6]>