I have the following datasets:
my_data = data.frame(col1 = c("abc", "bcd", "bfg", "eee", "eee") , id = 1:5)
my_data_1 = data.frame(col1 = c("abc", "byd", "bgg", "fef", "eee") , id = 1:5)
I defined an object as follows:
unique = unique(my_data_1[c("col1")])
I want to select all rows in "my_data" in which "col1" contains any value within "unique":
output <- my_data[which(my_data$col1 %in% as.list(unique) ), ]
But this is returning an empty selection:
[1] col1 id
<0 rows> (or 0-length row.names)
Is there another way to do this in R?
Thank you!
Note: The standard way to do this is like this:
> as.list(unique)
$col1
[1] "abc" "byd" "bgg" "fef" "eee"
output <- my_data[which(my_data$col1 %in% c("abc", "byd" ,"bgg", "fef", "eee") ), ]
But I am looking for a shortcut in which I don't have to manually type out everything.
CodePudding user response:
You are trying to subset one data.frame
to rows that match the unique values of a column in another data.frame
.
Your attempted solution returns no elements because unique
is a data.frame
and when you coerce it to a list you are stuck with a list instead of a vector that can be used to subset rows. When subsetting using foo[bar, ]
, bar
should be a vector either with the indices of the rows to keep (e.g. foo[c(1,2), ]
or a logical value for each index in the data.frame
. All you need to do is use %in%
with the vector of unique values itself.
You don't need to use list()
for this and which()
is redundant since you can subset the data.frame using a logical vector instead of row indices. The logic behind this latter point is that %in%
is returning TRUE
or FALSE
for each row of my_data
, which can be used to subset. All that which()
is doing is getting the indices of rows that are TRUE
and subsetting by index. However, that is entirely redundant.
# Your example data
my_data = data.frame(col1 = c("abc", "bcd", "bfg", "eee", "eee") , id = 1:5)
my_data_1 = data.frame(col1 = c("abc", "byd", "bgg", "fef", "eee") , id = 1:5)
unique = unique(my_data_1[c("col1")])
# Show that unique is a data.frame
str(unique)
#> 'data.frame': 5 obs. of 1 variable:
#> $ col1: chr "abc" "byd" "bgg" "fef" ...
# Show that unique$col1 is a vector
str(unique$col1)
#> chr [1:5] "abc" "byd" "bgg" "fef" "eee"
# Show what a logical test with the character vector does
my_data$col1 %in% unique$col1
#> [1] TRUE FALSE FALSE TRUE TRUE
# We can use this to subset
my_data[my_data$col1 %in% unique$col1, ]
#> col1 id
#> 1 abc 1
#> 4 eee 4
#> 5 eee 5
You could also combine steps and simply use:
my_data[my_data$col1 %in% unique(my_data_1$col1), ]
#> col1 id
#> 1 abc 1
#> 4 eee 4
#> 5 eee 5
CodePudding user response:
Other ways :
Base R
merge(my_data, unique, by = "col1")
#or merge(unique, my_data, by = "col1")
# col1 id
#1 abc 1
#2 eee 4
#3 eee 5
dplyr
library(dplyr)
inner_join(my_data,unique, by = "col1")
# or inner_join(unique, my_data, by = "col1")
# col1 id
#1 abc 1
#2 eee 4
#3 eee 5