Why doesn't operator %in% work with tibbles?-CodePudding

I'm working with a tibble dataframe and I need to subset it, lefting only the lines which the variable "code" match the vector of chosen codes. Recently I'm entering the dplyr world, so I'm trying to use tibbles and %>% for it.

For example, let code and df be:

v_code <-c(314480505000001, 314480505000002, 314480505000003, 
          314480505000004, 314480505000005, 314480505000006, 
          314480505000007, 314480505000008, 314480505000009, 
          314480505000010)
code<-tibble(code=v_code)
v_df <- c(314470605000018, 314470605000019, 314470605000020, 
            314470605000021, 314470605000022, 314480505000001, 
            314480505000002, 314480505000003, 314480505000004, 
            314480505000005, 314480505000006, 314480505000007, 
            314480505000008, 314480505000009, 314480505000010)
df <- tibble (v_df, da = 1:15)
df2 <- data.frame(v_df, da = 1:15)

So, if I try:

df2[which(df2$v_df %in% v_code),]

I get what I want, but using tibbles:

df %>% filter(v_df %in% code)

I got

# A tibble: 0 x 2
# ... with 2 variables: v_df <dbl>, da <int>

In the same way, typing which(tibble(v_df) %in% code) returns integer(0).

I tried convert code and v_df into characters, but I doesn't work either.

I would thank any help.

CodePudding user response：

library(tidyverse)

v_code <-c(314480505000001, 314480505000002, 314480505000003, 
           314480505000004, 314480505000005, 314480505000006, 
           314480505000007, 314480505000008, 314480505000009, 
           314480505000010)
code<-tibble(code=v_code)
v_df <- c(314470605000018, 314470605000019, 314470605000020, 
          314470605000021, 314470605000022, 314480505000001, 
          314480505000002, 314480505000003, 314480505000004, 
          314480505000005, 314480505000006, 314480505000007, 
          314480505000008, 314480505000009, 314480505000010)
df <- tibble (v_df, da = 1:15)
df2 <- data.frame(v_df, da = 1:15)

df %>% filter(v_df %in% code$code)

output

# A tibble: 10 x 2
      v_df    da
     <dbl> <int>
 1 3.14e14     6
 2 3.14e14     7
 3 3.14e14     8
 4 3.14e14     9
 5 3.14e14    10
 6 3.14e14    11
 7 3.14e14    12
 8 3.14e14    13
 9 3.14e14    14
10 3.14e14    15

CodePudding user response：

I have tried to address the question here, so hopefully it helps you understand how %in% works, and when and how it can be used.

The reason your example doesn't work is because code is a tibble. If code is a vector (v_code), then it will work. For example:

df %>% filter(v_df %in% v_code)

# # A tibble: 10 x 2
#      v_df    da
#     <dbl> <int>
# 1 3.14e14     6
# 2 3.14e14     7
# 3 3.14e14     8
# 4 3.14e14     9
# 5 3.14e14    10
# 6 3.14e14    11
# 7 3.14e14    12
# 8 3.14e14    13
# 9 3.14e14    14
# 10 3.14e14   15

See ?`%in%` for more details, but in short:

x %in% table

x = vector or NULL: the values to be matched. Long vectors are supported
table = vector or NULL: the values to be matched against. Long vectors are not supported.

When filtering using %in%, in this example you are supplying a column as x, which is treated as a vector and then compared to the table vector you supply.

I guess this could be a little confusing seeing as you can supply a tibble column as the first argument, but the second argument has to be a vector. To make it more obvious why this wouldn't work, imagine if the code tibble had multiple columns - which column(s) would %in% use for it's comparison?

However, it looks like you both x and table can be columns, if they are in the same tibble. For example, let's make a tibble with 2 columns which we want to compare:

tibble(
  x = letters[1:10], 
  y = letters[c(1:5, 8:6, 9:10)]
  ) %>% 
  mutate(
    match = x == y
  ) %>% 
  {. ->> my_tibb}

my_tibb

# # A tibble: 10 x 3
#    x     y     match
#    <chr> <chr> <lgl>
# 1  a     a     TRUE 
# 2  b     b     TRUE 
# 3  c     c     TRUE 
# 4  d     d     TRUE 
# 5  e     e     TRUE 
# 6  f     h     FALSE
# 7  g     g     TRUE 
# 8  h     f     FALSE
# 9  i     i     TRUE 
# 10 j     j     TRUE

Now, we use %in% to see if x matches y. Of course this could be done using x == y or match == TRUE for this example, but this demonstrates how it still works.

my_tibb %>% 
  rowwise %>% 
  filter(
    x %in% y
  )

# # A tibble: 8 x 3
# # Rowwise: 
#   x     y     match
#   <chr> <chr> <lgl>
# 1 a     a     TRUE 
# 2 b     b     TRUE 
# 3 c     c     TRUE 
# 4 d     d     TRUE 
# 5 e     e     TRUE 
# 6 g     g     TRUE 
# 7 i     i     TRUE 
# 8 j     j     TRUE

Alternatively, if your table object was a column which itself contained a vector, then %in% can still be used. In this example, we make a column z_list which is a list of 6 random letters. This is coerced to a string (z_string) just so we can see which letters they are in the tibble console preview:

set.seed(3)

tibble(
  x = letters[1:10]
) %>% 
  rowwise %>%
  mutate(
    z_list = list(runif(6, min = 1, max = 26) %>%
               round %>%
               letters[.]),
    z_string = str_c(z_list, collapse = ', ')
  ) %>% 
  {. ->> my_tibb2}

my_tibb2

# # A tibble: 10 x 3
# # Rowwise: 
#    x     z_list    z_string        
#    <chr> <list>    <chr>           
# 1  a     <chr [6]> e, u, k, i, p, p
# 2  b     <chr [6]> d, h, o, q, n, n
# 3  c     <chr [6]> n, o, w, v, d, s
# 4  d     <chr [6]> w, h, g, a, d, c
# 5  e     <chr [6]> g, u, p, x, o, t
# 6  f     <chr [6]> j, j, e, l, g, i
# 7  g     <chr [6]> w, f, o, f, h, u
# 8  h     <chr [6]> e, o, k, h, b, d
# 9  i     <chr [6]> i, u, g, f, w, z
# 10 j     <chr [6]> v, x, m, g, d, h

Then we can use %in% to see when x is in the z_list column:

# # A tibble: 3 x 3
# # Rowwise: 
#   x     z_list    z_string        
#   <chr> <list>    <chr>           
# 1 d     <chr [6]> w, h, g, a, d, c
# 2 h     <chr [6]> e, o, k, h, b, d
# 3 i     <chr [6]> i, u, g, f, w, z

x %in% z_string doesn't work because z_string is a character string of several letters (like a word), so a single-letter string (x) won't match it.

If you did want to see if a letter (x) was in a word, you would have to split the word into separate letters using str_extract_all() and make it into a list, like below.

my_tibb2 %>%
  mutate(
    word = str_replace_all(z_string, ', ', ''), 
    word_list = str_extract_all(word, boundary('character'))
  ) %>% 
  {. ->> my_tibb3}

# # A tibble: 10 x 5
# # Rowwise: 
#     x     z_list    z_string         word   word_list
#     <chr> <list>    <chr>            <chr>  <list>   
#  1  a     <chr [6]> e, u, k, i, p, p eukipp <chr [6]>
#  2  b     <chr [6]> d, h, o, q, n, n dhoqnn <chr [6]>
#  3  c     <chr [6]> n, o, w, v, d, s nowvds <chr [6]>
#  4  d     <chr [6]> w, h, g, a, d, c whgadc <chr [6]>
#  5  e     <chr [6]> g, u, p, x, o, t gupxot <chr [6]>
#  6  f     <chr [6]> j, j, e, l, g, i jjelgi <chr [6]>
#  7  g     <chr [6]> w, f, o, f, h, u wfofhu <chr [6]>
#  8  h     <chr [6]> e, o, k, h, b, d eokhbd <chr [6]>
#  9  i     <chr [6]> i, u, g, f, w, z iugfwz <chr [6]>
# 10  j     <chr [6]> v, x, m, g, d, h vxmgdh <chr [6]>

Then, we can use filter() as we did before:

my_tibb3 %>% 
  filter(
    x %in% word_list
  )

# # A tibble: 3 x 5
# # Rowwise: 
#   x     z_list    z_string         word   word_list
#   <chr> <list>    <chr>            <chr>  <list>   
# 1 d     <chr [6]> w, h, g, a, d, c whgadc <chr [6]>
# 2 h     <chr [6]> e, o, k, h, b, d eokhbd <chr [6]>
# 3 i     <chr [6]> i, u, g, f, w, z iugfwz <chr [6]>