Home > Net >  Python Web scraper for Stackoverflow
Python Web scraper for Stackoverflow

Time:10-20

I'm a newbie Python programmer learning how to design web scrapers. The tutorial I'm following used the code I've posted below and I'm trying to follow it. However, the code runs without displaying any info, plus I get two problem messages in VSCode telling me that:

  1. Missing timeout argument for method 'requests.get' can cause your program to hang indefinitely pylint (missing-timeout) [Ln 4, Col 12]

2)Missing module docstring pylint(missing-module-docstring)[Ln 1,Col 1]

import requests
from bs4 import BeautifulSoup

response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")

questions = soup.select(".s-post-summary    js-post-summary")
for question in questions:
    print(question.select_one(".s-link").getText())

CodePudding user response:

The correct way to select elements with those two classes is:

questions = soup.select('.s-post-summary.js-post-summary')

which will select any element that has these two classes, or

questions = soup.select('*[]')

which will select only elements that have these two classes - but only these two classes, and they would have to be arranged in that order as well.


However, as pointed out in another answer, the questions also have a data-post-type-id attribute [=1], and that is likely a much better identifier. You can specify that as well with select:

questions = soup.select('*[data-post-type-id="1"]')

You can actually even target the id attribute by using

questions = soup.select('*[id^="question-summary-"]')

to select all elements with id that starts with question-summary-


...and since I've already gone this far, if you just want the question titles, you can just directly use a single select statement:

for qLink in soup.select('*[data-post-type-id="1"] a.s-link'):
    print(qLink.get_text())


And about the two warning messages, I really don't think you need to worry much about them - you can just ignore them; but if they're bothering you, you might want to take a look at this thread which mentions some ways to either appease or disable the second warning. And for the first warning, you can either disable it similarly or just pass some timeout argument to your request like

response = requests.get("https://stackoverflow.com/questions", timeout=5)

which will raise a TimeoutError if the request takes longer than 5sec.

CodePudding user response:

You can find each question block using data-post-type-id attributes

questions = soup.find_all('div',attrs={'data-post-type-id': 1})
for question in questions:
    print(question.find('a', attrs={'class': 's-link'}).get('href')

Results will be:

questions/74129840/how-to-keep-audio-driver-busy-with-no-sound-output-using-powershell-or-command-p
/questions/74129837/vuejs-how-to-set-value-of-fecolormatrix
/questions/74129835/please-show-me-example-of-software-architecture
/questions/74129833/problem-getting-global-styles-includes-in-storybook-with-nx-and-angular-12
/questions/74129831/not-able-to-figure-out-common-expression-language-conditional-statement
/questions/74129830/why-can-i-successfully-assign-a-ref-to-a-string
....
  • Related