Home > Software engineering >  Scraping Google with rvest (2022 layout update)
Scraping Google with rvest (2022 layout update)

Time:12-10

A few weeks ago people in this site helped me with a code to get links, titles, and text from a google search using rvest. Now, I am trying to use the same code as provided in:

How to retrieve hyperlinks in google search using rvest

How to retrieve text below titles from google search using rvest

And is not working, giving next results:

library(rvest)
library(tidyverse)

#Part 1

url <- 'https://www.google.com/search?q=Mario Torres Mexico'

title <- "//div/div/div/a/h3"
text  <- paste0(title, "/parent::a/parent::div/following-sibling::div")

first_page <- read_html(url)

tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
       text = first_page %>% html_nodes(xpath = text) %>% html_text())

Result:

# A tibble: 0 x 2
# ... with 2 variables: title <chr>, text <chr>

And second part:

#Part 2
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")

titles %>%
  html_elements(xpath = "./parent::a") %>%
  html_attr("href") %>%
  str_extract("https.*?(?=&)")

Result:

character(0)

But in the past few weeks ago, this worked. Is it possible to fix this issue?

CodePudding user response:

It looks like Google decided to change their HTML layout, perhaps there were too many of us scrapers.

Here you go:

library(rvest)
library(tidyverse)

#Part 1

url <- 'https://www.google.com/search?q=Mario Torres Mexico'

title <- "//div/div/div/a/div/div/h3/div"
text  <- paste0(title, "/parent::h3/parent::div/parent::div/parent::a/parent::div/following-sibling::div/div[1]/div[1]/div[1]/div[1]/div[1]")

first_page <- read_html(url)

tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
       text = first_page %>% html_nodes(xpath = text) %>% html_text())

And part 2:

titles <- html_nodes(first_page, xpath = "//div/div/div/a/div/div/h3/div")

titles %>%
  html_elements(xpath = "./parent::h3/parent::div/parent::div/parent::a") %>%
  html_attr("href") %>%
  str_extract("https.*?(?=&)")
  • Related