While trying to scrape information from several links, I got the error: Error in open.connection(x, "rb") : HTTP error 404.
I feel like it has something to do with the first part of my for-loop, so I tried changing numbers
from character to numeric, but that did not fix the problem. I also tried advice here, however, it returned more problems.
Think you can spot where I went wrong?
library(rvest)
library(tidyverse)
pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>%
html_nodes('.categories-grid__category a') %>%
html_attr('href') %>%
paste0('https://www.bjjcompsystem.com', .)
# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)
numbers = as.numeric(numbers)
## create empty vector ----------------------------
master1.tree = data.frame()
## Create for loop ---------------------------------
for (i in length(numbers)){
url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))
ageDivision <- url %>% html_nodes('.category-title__age-division') %>% html_text()
gender <- url %>% html_nodes('.category-title__age-division .category-title__label') %>% html_text()
matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, data.frame(matches))
}
I also ran this, but it did not return the data frame for the scraped data. Instead it printed the results on the screen instead
map_df(get_links, function(i){
url <- read_html(i)
matches <- data.frame(ageDivision <- url %>%
html_nodes('.category-title__age-division') %>% html_text(),
gender <- url %>% html_nodes('.category-title__age-division .category-title__label') %>% html_text() )
master1.tree <- rbind(master1.tree, matches)
})
CodePudding user response:
library(rvest)
library(tidyverse)
pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>%
html_nodes('.categories-grid__category a') %>%
html_attr('href') %>%
paste0('https://www.bjjcompsystem.com', .)
# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)
numbers = as.numeric(numbers)
## create empty vector ----------------------------
master1.tree = data.frame()
## Create for loop ---------------------------------
for (i in numbers){
url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))
ageDivision <- url %>%
html_nodes('.category-title__age-division') %>%
html_text()
gender <- url %>%
html_nodes('.category-title__age-division .category-title__label') %>%
html_text()
matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, matches)
}
CodePudding user response:
Here is an alternative to your code. First, it's not necessary to extract the numbers. You can directly loop over the vector get_links
. Second, I use purrr::map_df
for the looping part which is a more concise way than using the for
loop. To this end I use a custom function to scrape one of your pages. Finally, I use trim=TRUE
with html_text
to remove which removes the leading and trailing white space:
library(rvest)
library(tidyverse)
pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>%
html_nodes('.categories-grid__category a') %>%
html_attr('href') %>%
paste0('https://www.bjjcompsystem.com', .)
scrape_page <- function(url) {
html <- read_html(url)
data.frame(
division = html %>% html_nodes('.category-title__age-division') %>% html_text(trim = TRUE),
gender = html %>% html_nodes('.category-title__age-division .category-title__label') %>% html_text(trim = TRUE)
)
}
master1.tree <- purrr::map_df(get_links[1:5], scrape_page)
master1.tree
#> division gender
#> 1 Master 1 Male
#> 2 Master 1 Male
#> 3 Master 1 Male
#> 4 Master 1 Male
#> 5 Master 1 Male