A few weeks ago people in this site helped me with a code to get links, titles, and text from a google search using rvest
. Now, I am trying to use the same code as provided in:
How to retrieve hyperlinks in google search using rvest
How to retrieve text below titles from google search using rvest
And is not working, giving next results:
library(rvest)
library(tidyverse)
#Part 1
url <- 'https://www.google.com/search?q=Mario Torres Mexico'
title <- "//div/div/div/a/h3"
text <- paste0(title, "/parent::a/parent::div/following-sibling::div")
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
Result:
# A tibble: 0 x 2
# ... with 2 variables: title <chr>, text <chr>
And second part:
#Part 2
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")
titles %>%
html_elements(xpath = "./parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")
Result:
character(0)
But in the past few weeks ago, this worked. Is it possible to fix this issue?
CodePudding user response:
It looks like Google decided to change their HTML layout, perhaps there were too many of us scrapers.
Here you go:
library(rvest)
library(tidyverse)
#Part 1
url <- 'https://www.google.com/search?q=Mario Torres Mexico'
title <- "//div/div/div/a/div/div/h3/div"
text <- paste0(title, "/parent::h3/parent::div/parent::div/parent::a/parent::div/following-sibling::div/div[1]/div[1]/div[1]/div[1]/div[1]")
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
And part 2:
titles <- html_nodes(first_page, xpath = "//div/div/div/a/div/div/h3/div")
titles %>%
html_elements(xpath = "./parent::h3/parent::div/parent::div/parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")