I have a list of movies that I want to scrap the genres from Google. I've built this code:
list=['Psychological thriller','Mystery','Crime film','Neo-noir','Drama','Crime Thriller','Indie film']
gen2 = {}
for i in list:
user_query = i 'movie genre'
URL = 'https://www.google.co.in/search?q=' user_query
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
c = soup.find(class_='EDblX DAVP1')
print(c)
if c != None:
genres = c.findAll('a')
gen2[i]= genres
But it returns an empty dict, so I checked one by one and it worked, for example:
user_query = 'Se7en movie genre'
URL = "https://www.google.co.in/search?q=" user_query
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
v = soup.find(class_='KKHQ8c')
h = []
genres = v.findAll('a')
for genre in genres:
h.append(genre.get_text())
So o find out that in the for loop the variable c is returning None. I can't figure out why! It only return None inside the loop.
CodePudding user response:
Maybe the domain www.google.co.in blocks your ip since you make requests continuously; it could return a valid html page but not the content you want. Consider putting a waiting time between your requests and use proxies in your requests
call.
proxies = {
"http" : "http://10.10.1.10:3128",
"https" : "https://10.10.1.11:1080"
}
r = requests.get(url, headers=headers, proxies=proxies)
PS. I can't comment yet, so answered directly.
CodePudding user response:
You get None
when your search doesn't return the special movie header. You can fix this by adding a space before 'movie genre'
.
Try this code (indentation fixed too):
list=['Psychological thriller', 'Mystery', 'Crime film', 'Neo-noir', 'Drama', 'Crime Thriller', 'Indie film']
gen2 = {}
for i in list:
# added one space before 'movie genre'
user_query = i ' movie genre'
URL = 'https://www.google.co.in/search?q=' user_query
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
c = soup.find(class_='EDblX DAVP1')
if c is None:
print("None")
else:
genres = c.findAll('a')
gen2[i]= genres
Now you get proper results for all elements of your list but Mystery
. Mystery
is the only search term that doesn't result in the movie header. You might want to change this to a different term.