Im new in webscraping and i wanted to get just a text from a google page (basically the date of a soccer match), but the soup doesnt get all the html (im gessing beacause of request) so i can't find it, I know it can be beacause of google using javascript and I should use selenium chromedriver, but the thing is that I need the code to be usable on an another computer so it cant really use it..
heres the code :
import pandas as pd
from bs4 import BeautifulSoup
import requests
a = "Newcastle"
url ="https://www.google.com/search?q=" a " next match"
response = requests.get(url)
soup = BeautifulSoup(response.text,"html.parser")
print(soup)
for a in soup.findAll('div') :
print(soup.get_text())
what i wanna find is
"<span >17/12, 13:30</span>"
it has
"//*[@id="sports-app"]/div/div[3]/div[1]/div/div/div/div/div[1]/div/div[1]/div/span[2]"
as xpath
Is it even possible ?
CodePudding user response:
Try to set User-Agent
header when requesting the page from Google:
import requests
from bs4 import BeautifulSoup
a = "Newcastle"
url = "https://www.google.com/search?q=" a " next match&hl=en"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
next_match = soup.select_one('[data-entityname="Match Header"]')
for t in next_match.select('[aria-hidden="true"]'):
t.extract()
text = next_match.get_text(strip=True, separator=" ")
print(text)
Prints:
Club Friendlies · Dec 17, 13:30 Newcastle VS Vallecano