I've got the Python Beautifulsoup script below (adapted to python 3 from that script ). It executes fine but nothing is returned in cmd.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
Newlines = re.compile(r'[\r\n]\s ')
def getPageText(url):
# given a url, get page content
data = urlopen(url).read()
# parse as html structured document
soup = BeautifulSoup(data, 'html.parser')
# kill javascript content
for s in soup.findAll('script'):
s.replaceWith('')
# find body and extract text
txt = soup.find('body').getText('\n')
# remove multiple linebreaks and whitespace
return Newlines.sub('\n', txt)
def main():
urls = [
'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
]
txt = [getPageText(url) for url in urls]
if __name__=="__main__":
main()
Here's my cmd output
Microsoft Windows [Version 10.0..]
(c) Microsoft Corporation. All rights reserved.
C:\Users\user\Desktop\urls>python urls.py
C:\Users\user\Desktop\urls>
Why doesn't it return the pages contents?
CodePudding user response:
Nothing is returned in cmd because there is no print
statement in the code.
if you want to print out all the texts parsed from the given URL just use print
function in main()
function
def main():
urls = [
'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
]
txt = [getPageText(url) for url in URLs]
for t in txt:
print(t)