Selecting Entries in a Data Frame Stored in a List-CodePudding

I have the following datasets:

my_data = data.frame(col1 = c("abc", "bcd", "bfg", "eee", "eee") , id = 1:5)
my_data_1 = data.frame(col1 = c("abc", "byd", "bgg", "fef", "eee") , id = 1:5)

I defined an object as follows:

unique = unique(my_data_1[c("col1")])

I want to select all rows in "my_data" in which "col1" contains any value within "unique":

output <- my_data[which(my_data$col1 %in% as.list(unique) ), ]

But this is returning an empty selection:

[1] col1 id  
<0 rows> (or 0-length row.names)

Is there another way to do this in R?

Thank you!

Note: The standard way to do this is like this:

> as.list(unique)
$col1
[1] "abc" "byd" "bgg" "fef" "eee"
output <-  my_data[which(my_data$col1 %in% c("abc", "byd" ,"bgg",  "fef", "eee") ), ]

But I am looking for a shortcut in which I don't have to manually type out everything.

CodePudding user response：

You are trying to subset one data.frame to rows that match the unique values of a column in another data.frame.

Your attempted solution returns no elements because unique is a data.frame and when you coerce it to a list you are stuck with a list instead of a vector that can be used to subset rows. When subsetting using foo[bar, ], bar should be a vector either with the indices of the rows to keep (e.g. foo[c(1,2), ] or a logical value for each index in the data.frame. All you need to do is use %in% with the vector of unique values itself.

You don't need to use list() for this and which() is redundant since you can subset the data.frame using a logical vector instead of row indices. The logic behind this latter point is that %in% is returning TRUE or FALSE for each row of my_data, which can be used to subset. All that which() is doing is getting the indices of rows that are TRUE and subsetting by index. However, that is entirely redundant.

# Your example data
my_data = data.frame(col1 = c("abc", "bcd", "bfg", "eee", "eee") , id = 1:5)
my_data_1 = data.frame(col1 = c("abc", "byd", "bgg", "fef", "eee") , id = 1:5)
unique = unique(my_data_1[c("col1")])

# Show that unique is a data.frame
str(unique)
#> 'data.frame':    5 obs. of  1 variable:
#>  $ col1: chr  "abc" "byd" "bgg" "fef" ...

# Show that unique$col1 is a vector
str(unique$col1)
#>  chr [1:5] "abc" "byd" "bgg" "fef" "eee"

# Show what a logical test with the character vector does
my_data$col1 %in% unique$col1
#> [1]  TRUE FALSE FALSE  TRUE  TRUE

# We can use this to subset
my_data[my_data$col1 %in% unique$col1, ]
#>   col1 id
#> 1  abc  1
#> 4  eee  4
#> 5  eee  5

You could also combine steps and simply use:

my_data[my_data$col1 %in% unique(my_data_1$col1), ]
#>   col1 id
#> 1  abc  1
#> 4  eee  4
#> 5  eee  5

CodePudding user response：

Other ways :

Base R

merge(my_data, unique, by = "col1") 
#or merge(unique, my_data, by = "col1")

#  col1 id
#1  abc  1
#2  eee  4
#3  eee  5

dplyr

library(dplyr)
inner_join(my_data,unique, by = "col1")
# or inner_join(unique, my_data, by = "col1")

#  col1 id
#1  abc  1
#2  eee  4
#3  eee  5