Home > front end >  webscraping from cnn function to get text from a article error in python
webscraping from cnn function to get text from a article error in python

Time:03-18

So i want to get the text from a specific article(not only one) so heres the function therefore:

def get_article():
    for url in get_href():
        options = webdriver.ChromeOptions()
        options.add_argument("--ignore-certificate-error")
        options.add_argument("--ignore-ssl-errors")
        service = Service(executable_path='chromedriver.exe')
        driver = webdriver.Chrome(service=service, options=options)

        driver.get(url)
        time.sleep(4)

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        driver.minimize_window()

        text1 = url.get('div.Paragraph__component > span')
        print(text1)

The error i get is:

Traceback (most recent call last): File "c:\Users\user\Desktop\Informatik\Praktik\Projekte\Python\stiil_working_on\news_automation\try.py", line 116, in get_article() File "c:\Users\user\Desktop\Informatik\Praktik\Projekte\Python\stiil_working_on\news_automation\try.py", line 99, in get_article text1 = url.get('div.Paragraph__component > span') AttributeError: 'str' object has no attribute 'get'

What i want to do in this function is use the url got from get_href():

def get_href():
    all_results = []


    for h3 in soup.select('h3.cnn-search__result-headline > a'):
        title = h3.text
        url_ = h3.get('href')
        abs_url = 'https:'  url_

        all_results.append(abs_url)


    return all_results

and then open it up and webscrape the article text from it but its not wrking and i don't know how to figure it out. Someone know how to do it?

CodePudding user response:

You have to use

text1 = soup.select_one('div.Paragraph__component > span')#select gives a list
print(text1.text)

CodePudding user response:

text1 = url.get('div.Paragraph__component > span')

would be the below in Selenium

text1=driver.find_element(By.CSS_SELECTOR,"div.Paragraph__component > span").text

or in beautifulsoup

text1 = soup.select_one('div.Paragraph__component > span')
print(text1.text)

Import:

from selenium.webdriver.common.by import By

CodePudding user response:

The problem is url is a string and a string does not have a method get() which you are trying to call, hence the error. You would need to call get() on the parsed HTML which you have in your soup variable for it to work. So it would look like this:

text1 = soup.get('div.Paragraph__component > span')

On another note: Is there a reason why you use Selenium to web scrape? You could just make a HTTP GET request using e.g. requests library to download the article which would be significantly faster.

It would then look similar to this.

import sys

import requests
from bs4 import BeautifulSoup

urls = ["www.google.com"]

def get_article():
    for url in urls:
        # get page content by HTTP request
        page_response = requests.get(url)
        # check if request was successful
        if page_response.status_code != 200:
            sys.exit(1)

        # parse the website's HTML
        parsed_html = BeautifulSoup(page_response.text, 'html.parser')

        # on this you can then call get()
        # NOTE: this will not work for the google page
        text1 = parsed_html.get('div.Paragraph__component > span')
        print(text1)
  • Related