Home > OS >  How to scrape data with filters from the website when the URL doesn't change?
How to scrape data with filters from the website when the URL doesn't change?

Time:03-25

I've scraped data from this list in R, however it doesn't include the website filters (List = Oxford 3000 and CEFR level = A1) that I had applied, and there aren't variables as far as I can see which I can use to filter the data in R.

Is there some other way I can get just the data I want? The URL doesn't appear to change with filtering.

Here is my code:

url <- "https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000" 

url %>%
  map(. %>%
    read_html() %>%
      html_nodes(".belong-to , .pos , a") %>%
      html_text()
  ) %>%
  unlist() -> ox3ka1

CodePudding user response:

To select only the words with filter a1 we can do the following,

df = 'https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000' %>% read_html() %>% html_nodes('.top-g') %>% html_nodes( "li[data-ox5000 = 'a1']") %>% html_text()

head(df)
[1] "   a   indefinite articlea1      " "   about   adverba1      "         "   about   prepositiona1      "    "   above   adverba1      "        
[5] "   above   prepositiona1      "    "   across   adverba1      "   

Further reference, How do I use html_nodes to select nodes with "attribute = x" in R?

  • Related