Home > front end >  How do you properly search using BS4?
How do you properly search using BS4?

Time:01-13

I'm still learning python and thought a good project would be to make an Instagram Scraper. First I thought of trying to scrape Kylie Jenners's profile picture I thought I would use BS4 to search but then i ran into an issue.

import requests
from bs4 import BeautifulSoup as bs

instagramUser = input('Input Instagram Username: ')
url = 'https://instagram.com/'   instagramUser
r = requests.get(url)

soup = bs(r.text, 'html.parser')


profile_image = soup.find('img', class_ = "_6q-tv")['src']

print(profile_image)

On the line where i declare profile_image i get an error saying:

line 12, in profile_image = soup.find('img', class_ = "_6q-tv")['src'] TypeError: 'NoneType' object is not subscriptable

I'm not sure why it doesn't work, my guess is I'm reading the html on Instagram wrong and searching incorrectly. I wanted to ask more experienced people than me on what I'm doing wrong, any help would be appreciated :)

CodePudding user response:

Instagram's content is loaded via javascript so scraping it like this won't work. It's also got many ways of stopping scraping so you will have a tough time scraping it without automating a browser with something like Selenium.

You can see what is happening when you navigate to a page by opening your browser's Developer Tools - Network - fetch/XHR and reloading the page, there you can see all the other content that is loaded, sometimes an easily accessible backend api is visible which loads the data you want and can be scraped (not the case with Instagram sadly, it is heavily protected)

CodePudding user response:

You can disect the contents of line 12 into two commands:

image_tag = soup.find('img', class_ = "_6q-tv")
profile_image = image_tag['src']

The error

line 12, in profile_image = soup.find('img', class_ = "_6q-tv")['src'] TypeError: 'NoneType' object is not subscriptable

indicates that the result of the first command is None, which is Python's null value, which represents the absence of a value. This value does not implement the subscript operator ([]), thus, it's not subscriptable.

The reason probably is that soup.find didn't found any tag that matches your search criteria and returns None.

To debug this issue, I suggest you to write the source code into a file and inspect that file with a text editor of your choice (or directly in an interactive Python console). That way, you see what your Python program 'sees'. If you use the developer tools in the browser instead, you see the state of a Web page after having executed a bunch of JavaScript, but BeautifulSoup is oblivious of the JavaScript code. It just fetches the document as-is from the server.

As the answer of bushcat69 suggests, it's probably hard to scrape content from Instagram, so you may better be off with a simpler Website, which doesn't use as much JavaScript and protective measures against webscraping.

  •  Tags:  
  • Related