I've scraped data from this list in R, however it doesn't include the website filters (List = Oxford 3000 and CEFR level = A1) that I had applied, and there aren't variables as far as I can see which I can use to filter the data in R.
Is there some other way I can get just the data I want? The URL doesn't appear to change with filtering.
Here is my code:
url <- "https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000"
url %>%
map(. %>%
read_html() %>%
html_nodes(".belong-to , .pos , a") %>%
html_text()
) %>%
unlist() -> ox3ka1
CodePudding user response:
To select only the words with filter a1
we can do the following,
df = 'https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000' %>% read_html() %>% html_nodes('.top-g') %>% html_nodes( "li[data-ox5000 = 'a1']") %>% html_text()
head(df)
[1] " a indefinite articlea1 " " about adverba1 " " about prepositiona1 " " above adverba1 "
[5] " above prepositiona1 " " across adverba1 "
Further reference, How do I use html_nodes to select nodes with "attribute = x" in R?