received Error in open.connection(x, "rb") : HTTP error 404. after running a for-loop in r-CodePudding

While trying to scrape information from several links, I got the error: Error in open.connection(x, "rb") : HTTP error 404.

I feel like it has something to do with the first part of my for-loop, so I tried changing numbers from character to numeric, but that did not fix the problem. I also tried advice here, however, it returned more problems.

Think you can spot where I went wrong?

library(rvest)
library(tidyverse)

pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>% 
  html_nodes('.categories-grid__category a') %>% 
  html_attr('href') %>%
  paste0('https://www.bjjcompsystem.com', .) 

# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)  
numbers = as.numeric(numbers)

## create empty vector  ----------------------------
master1.tree = data.frame()

## Create for loop ---------------------------------
for (i in length(numbers)){
  url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))
  
ageDivision <- url %>% html_nodes('.category-title__age-division') %>% html_text()

gender <- url %>% html_nodes('.category-title__age-division  .category-title__label') %>% html_text()  

matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, data.frame(matches))
}

I also ran this, but it did not return the data frame for the scraped data. Instead it printed the results on the screen instead

map_df(get_links, function(i){
  url <- read_html(i)
  
matches <- data.frame(ageDivision <- url %>% 
  html_nodes('.category-title__age-division') %>% html_text(),
gender <- url %>% html_nodes('.category-title__age-division  .category-title__label') %>% html_text() ) 

master1.tree <- rbind(master1.tree, matches)
})

CodePudding user response：

library(rvest)
library(tidyverse)

pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')

get_links <- pageMen %>% 
  html_nodes('.categories-grid__category a') %>% 
  html_attr('href') %>%
  paste0('https://www.bjjcompsystem.com', .) 

# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)  
numbers = as.numeric(numbers)

## create empty vector  ----------------------------
master1.tree = data.frame()

## Create for loop ---------------------------------
for (i in numbers){
  url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))

ageDivision <- url %>% 
html_nodes('.category-title__age-division') %>% 
html_text()

gender <- url %>% 
html_nodes('.category-title__age-division  .category-title__label') %>% 
html_text()

matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, matches)
}

CodePudding user response：

Here is an alternative to your code. First, it's not necessary to extract the numbers. You can directly loop over the vector get_links. Second, I use purrr::map_df for the looping part which is a more concise way than using the for loop. To this end I use a custom function to scrape one of your pages. Finally, I use trim=TRUE with html_text to remove which removes the leading and trailing white space:

library(rvest)
library(tidyverse)

pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')

get_links <- pageMen %>% 
  html_nodes('.categories-grid__category a') %>% 
  html_attr('href') %>%
  paste0('https://www.bjjcompsystem.com', .)

scrape_page <- function(url) {
  html <- read_html(url)
  data.frame(
    division = html %>% html_nodes('.category-title__age-division') %>% html_text(trim = TRUE),
    gender = html %>% html_nodes('.category-title__age-division  .category-title__label') %>% html_text(trim = TRUE)
  )
}

master1.tree <- purrr::map_df(get_links[1:5], scrape_page)

master1.tree
#>   division gender
#> 1 Master 1   Male
#> 2 Master 1   Male
#> 3 Master 1   Male
#> 4 Master 1   Male
#> 5 Master 1   Male