I want to filter out specific rows from a data set I got from the project Gutenberg r package. For that, I want to select only rows that contain a given word, but the problem is all my rows have got more than one word so using the filter() will not work.
For example:
The sentence is: "The Little Vanities of Mrs. Whittaker: A Novel"
. I want to filter out all the rows that contain the word "novel", but I can not find out how.
gutenberg_full_data <- left_join(gutenberg_works(language == "en"), gutenberg_metadata, by = "gutenberg_id")
gutenberg_full_data <- left_join(gutenberg_full_data, gutenberg_subjects)
gutenberg_full_data <- subset(gutenberg_full_data, select = -c(rights.x,has_text.x,language.y,gutenberg_bookshelf.x, gutenberg_bookshelf.y,rights.y, has_text.y,gutenberg_bookshelf.y, gutenberg_author_id.y, title.y, author.y))
gutenberg_full_data <- gutenberg_full_data[-which(is.na(gutenberg_full_data$author.x)),]
novels <- gutenberg_full_data %>% filter(subject == "Drama")
original_books <- gutenberg_download((novels), meta_fields = "title")
original_books
tidy_books <- original_books %>%
unnest_tokens(word, text)
This is the code I used to get the data frame using the "gutenbergr" package.
CodePudding user response:
You can use grepl()
from base R for this. grepl()
returns True
if the word is present and False
otherwise.
text = "The Little Vanities of Mrs. Whittaker: A Novel"
word = "Novel"
> grepl(word, text)
[1] TRUE
Your original_books
file will require large downloads so I'm showing you an example of searching "Plays" in title.x
of your novels
data frame.
> novels %>%
mutate(contains_play = grepl("Plays", title.x))
# A tibble: 54 × 8
gutenberg_id title.x author.x gutenberg_autho… language.x subject_type subject contains_play
<int> <chr> <chr> <int> <chr> <chr> <chr> <lgl>
1 1308 A Florentine Tr… Wilde, Oscar 111 en lcsh Drama FALSE
2 2270 Shakespeare's F… Shakespeare,… 65 en lcsh Drama FALSE
3 2587 Life Is a Dream Calderón de … 970 en lcsh Drama FALSE
4 4970 There Are Crime… Strindberg, … 1609 en lcsh Drama FALSE
5 5053 Plays by August… Strindberg, … 1609 en lcsh Drama TRUE
6 5618 Six Plays Darwin, Flor… 1814 en lcsh Drama TRUE
7 6587 King Arthur's S… Dell, Floyd 2100 en lcsh Drama TRUE
8 6782 The Robbers Schiller, Fr… 289 en lcsh Drama FALSE
9 6790 Demetrius: A Pl… Schiller, Fr… 289 en lcsh Drama FALSE
10 6793 The Bride of Me… Schiller, Fr… 289 en lcsh Drama FALSE
# … with 44 more rows
Note that grepl()
allows the second argument to be a vector. Thus, using rowwise()
is not necessary. If it allowed searching only within a string, we would have to use rowwise()
.
CodePudding user response:
You are probably looking for something like below. It will look for any string that contains the keyword you put in.
stringr::str_detect(variable, "keyword")
Example to subset only the specific string
library(stringr)
df <- df %>% filter(str_detect(column_that_contains_the_word, "the word"))
In your case (I assume) to filter out the specific string and keep all other
library(stringr)
original_books <- original_books %>% filter(!str_detect(title, c("novel", "Novel", "NOVEL")))
Let us know if it worked.