Home > Mobile >  web scraping with BS4 returning None
web scraping with BS4 returning None

Time:06-22

I have a list of movies that I want to scrap the genres from Google. I've built this code:

list=['Psychological thriller','Mystery','Crime film','Neo-noir','Drama','Crime Thriller','Indie film']
gen2 = {}
for i in list:
  user_query = i  'movie genre'
  URL = 'https://www.google.co.in/search?q='   user_query
  headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
  page = requests.get(URL, headers=headers)
  soup = BeautifulSoup(page.content, 'html.parser')
  c = soup.find(class_='EDblX DAVP1')
  print(c)
  if c != None:
    genres = c.findAll('a')
    gen2[i]= genres

But it returns an empty dict, so I checked one by one and it worked, for example:

user_query = 'Se7en movie genre' 
URL = "https://www.google.co.in/search?q="   user_query
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
v = soup.find(class_='KKHQ8c')
h = []
genres = v.findAll('a')
for genre in genres:
  h.append(genre.get_text())

So o find out that in the for loop the variable c is returning None. I can't figure out why! It only return None inside the loop.

CodePudding user response:

Maybe the domain www.google.co.in blocks your ip since you make requests continuously; it could return a valid html page but not the content you want. Consider putting a waiting time between your requests and use proxies in your requests call.

proxies = { 
              "http"  : "http://10.10.1.10:3128", 
              "https" : "https://10.10.1.11:1080"
          }

r = requests.get(url, headers=headers, proxies=proxies)

PS. I can't comment yet, so answered directly.

CodePudding user response:

You get None when your search doesn't return the special movie header. You can fix this by adding a space before 'movie genre'.

Try this code (indentation fixed too):

list=['Psychological thriller', 'Mystery', 'Crime film', 'Neo-noir', 'Drama', 'Crime Thriller', 'Indie film']

gen2 = {}
for i in list:
    # added one space before 'movie genre'
    user_query = i   ' movie genre'
    URL = 'https://www.google.co.in/search?q='   user_query
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
    page = requests.get(URL, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    c = soup.find(class_='EDblX DAVP1')
    if c is None:
        print("None")
    else:
        genres = c.findAll('a')
        gen2[i]= genres

Now you get proper results for all elements of your list but Mystery. Mystery is the only search term that doesn't result in the movie header. You might want to change this to a different term.

  • Related