Home > Enterprise >  Webscraping in Python with Beautifulsoup
Webscraping in Python with Beautifulsoup

Time:07-08

I would like to scrape a website in Python. The class_ name is too long for Pycharm (>120 characters), so I defined a variable to split it up. However, it still doesn't work. It only returns "None". What am I doing wrong?

html = requests.get("https://www.foxestalk.co.uk/topic/127651-youri-tielemans/page/122/#comments").text
soup = BeautifulSoup(html, "lxml")

test = "cPost ipsBox ipsResponsive_pull  ipsComment  ipsComment_parent ipsClearfix "
test2 = test   "ipsClear ipsColumns ipsColumns_noSpacing ipsColumns_collapsePhone    "

comment = soup.find("article", class_=test2)

CodePudding user response:

You need to fix your spacing between class items:

html = requests.get("https://www.foxestalk.co.uk/topic/127651-youri-tielemans/page/122/#comments").text
soup = BeautifulSoup(html, "lxml")

test = "cPost ipsBox ipsResponsive_pull ipsComment ipsComment_parent ipsClearfix "
test2 = test   "ipsClear ipsColumns ipsColumns_noSpacing ipsColumns_collapsePhone"


comment = soup.find_all("article", {'class':test2})

CodePudding user response:

As mentioned by @1extraline you should fix your spaces / typos to get your goal.

I would also recommend avoiding to select your elements by classes, they are more often generated dynamically and it is not necessary to use all of them.

So change your strategy and select by more static attributes like id or by structure like tag.

In your specific case simply use css selectors to shorten your selection:

soup.select('article')

or a bit more specific:

soup.select('article[id^="elComment"]')

Example

import requests
from bs4 import BeautifulSoup

html = requests.get("https://www.foxestalk.co.uk/topic/127651-youri-tielemans/page/122/#comments").text
soup = BeautifulSoup(html, "lxml")

print(len(soup.select('article')))

data = []

for e in soup.select('article[id^="elComment"]'):
    data.append({
        'username':e.h3.text.strip(),
        'postCount': e.select_one('ul ul li').text.strip(),
        'whatever': 'you like to scrape'
    })

data
Output
[{'username': 'foxfanazer',
  'postCount': '31,325',
  'whatever': 'you like to scrape'},
 {'username': 'CrispinLA in Texas',
  'postCount': '2,197',
  'whatever': 'you like to scrape'},
 {'username': 'happy85',
  'postCount': '1,077',
  'whatever': 'you like to scrape'},
 {'username': 'CrispinLA in Texas',
  'postCount': '2,197',
  'whatever': 'you like to scrape'},
 {'username': "Sharpe's Fox",
  'postCount': '7,867',
  'whatever': 'you like to scrape'},
 {'username': 'foxfanazer',
  'postCount': '31,325',
  'whatever': 'you like to scrape'},
 {'username': "Sharpe's Fox",
  'postCount': '7,867',
  'whatever': 'you like to scrape'},
 {'username': 'cropstonfox',
  'postCount': '825',
  'whatever': 'you like to scrape'},...]
  • Related