Scraping Google with rvest (2022 layout update)-CodePudding

A few weeks ago people in this site helped me with a code to get links, titles, and text from a google search using rvest. Now, I am trying to use the same code as provided in:

How to retrieve hyperlinks in google search using rvest

How to retrieve text below titles from google search using rvest

And is not working, giving next results:

library(rvest)
library(tidyverse)

#Part 1

url <- 'https://www.google.com/search?q=Mario Torres Mexico'

title <- "//div/div/div/a/h3"
text  <- paste0(title, "/parent::a/parent::div/following-sibling::div")

first_page <- read_html(url)

tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
       text = first_page %>% html_nodes(xpath = text) %>% html_text())

Result:

# A tibble: 0 x 2
# ... with 2 variables: title <chr>, text <chr>

And second part:

#Part 2
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")

titles %>%
  html_elements(xpath = "./parent::a") %>%
  html_attr("href") %>%
  str_extract("https.*?(?=&)")

Result:

character(0)

But in the past few weeks ago, this worked. Is it possible to fix this issue?

CodePudding user response：

It looks like Google decided to change their HTML layout, perhaps there were too many of us scrapers.

Here you go:

library(rvest)
library(tidyverse)

#Part 1

url <- 'https://www.google.com/search?q=Mario Torres Mexico'

title <- "//div/div/div/a/div/div/h3/div"
text  <- paste0(title, "/parent::h3/parent::div/parent::div/parent::a/parent::div/following-sibling::div/div[1]/div[1]/div[1]/div[1]/div[1]")

first_page <- read_html(url)

tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
       text = first_page %>% html_nodes(xpath = text) %>% html_text())

And part 2:

titles <- html_nodes(first_page, xpath = "//div/div/div/a/div/div/h3/div")

titles %>%
  html_elements(xpath = "./parent::h3/parent::div/parent::div/parent::a") %>%
  html_attr("href") %>%
  str_extract("https.*?(?=&)")