Home > database >  python doesn't get page content
python doesn't get page content

Time:02-13

I've got the Python Beautifulsoup script below (adapted to python 3 from that script ). It executes fine but nothing is returned in cmd.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

Newlines = re.compile(r'[\r\n]\s ')

def getPageText(url):
    # given a url, get page content
    data = urlopen(url).read()
    # parse as html structured document
    soup = BeautifulSoup(data, 'html.parser')
    # kill javascript content
    for s in soup.findAll('script'):
        s.replaceWith('')
    # find body and extract text
    txt = soup.find('body').getText('\n')
    # remove multiple linebreaks and whitespace
    return Newlines.sub('\n', txt)

def main():
    urls = [
        'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
        'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
    ]
    txt = [getPageText(url) for url in urls]

if __name__=="__main__":
    main()

Here's my cmd output

Microsoft Windows [Version 10.0..]
(c) Microsoft Corporation. All rights reserved.

C:\Users\user\Desktop\urls>python urls.py

C:\Users\user\Desktop\urls>

Why doesn't it return the pages contents?

CodePudding user response:

Nothing is returned in cmd because there is no print statement in the code. if you want to print out all the texts parsed from the given URL just use print function in main() function

def main():
    urls = [
        'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
        'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
    ]
    txt = [getPageText(url) for url in URLs]
    for t in txt:
        print(t)


  • Related