ERROR: 'NoneType' object has no attribute 'find

I'm doing web scraping of a web page called: CVE Trends

import bs4, requests,webbrowser

LINK = "https://cvetrends.com/"
PRE_LINK = "https://nvd.nist.gov/"

response = requests.get(LINK)
response.raise_for_status()
soup=bs4.BeautifulSoup(response.text,'html.parser')
div_tweets=soup.find('div',class_='tweet_text')

a_tweets=div_tweets.find_all('a')
    
link_tweets =[]
for a_tweet in a_tweets:
    link_tweet= str(a_tweet.get('href'))
    if PRE_LINK in link_tweet:
        link_tweets.append(link_tweet)

from pprint import pprint
pprint(link_tweets)

This is the code that I've written so far. I've tried in many ways but it gives always the same error:

'NoneType' object has no attribute 'find_all'

Can someone help me please? I really need this. Thanks in advance for any answer.

CodePudding user response：

This is happening because soup.find("div", class_="tweet_text") is not finding anything, so it returns None. This is happening because the site you're trying to scrape is populated using javascript, so when you send a get request to the site, this is what you're getting back:

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <title>
   CVE Trends - crowdsourced CVE intel
  </title>
  <meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="description"/>
  <meta content="trending CVEs, CVE intel, CVE trends" name="keywords"/>
  <meta content="CVE Trends - crowdsourced CVE intel" name="title" property="og:title">
   <meta content="Simon Bell" name="author"/>
   <meta content="website" property="og:type">
    <meta content="https://cvetrends.com/images/cve-trends.png" name="image" property="og:image">
     <meta content="https://cvetrends.com" property="og:url">
      <meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." property="og:description"/>
      <meta content="en_GB" property="og:locale"/>
      <meta content="en_US" property="og:locale:alternative"/>
      <meta content="CVE Trends" property="og:site_name"/>
      <meta content="summary_large_image" name="twitter:card"/>
      <meta content="@SimonByte" name="twitter:creator"/>
      <meta content="CVE Trends - crowdsourced CVE intel" name="twitter:title"/>
      <meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="twitter:description"/>
      <meta content="https://cvetrends.com/images/cve-trends.png" name="twitter:image"/>
      <link href="https://cvetrends.com/favicon.ico" id="favicon" rel="icon" sizes="32x32"/>
      <link href="https://cvetrends.com/apple-touch-icon.png" id="apple-touch-icon" rel="apple-touch-icon"/>
      <link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/5.1.0/css/bootstrap.min.css" rel="stylesheet"/>
     </meta>
    </meta>
   </meta>
  </meta>
 </head>
 <body>
  <div id="root">
  </div>
  <noscript>
   Please enable JavaScript to run this app.
  </noscript>
  <script src="https://cvetrends.com/js/main.d0aa7136854f54748577.bundle.js">
  </script>
 </body>
</html>

You can verify this using print(soup.prettify()).

To be able to scrape this site you'll probable have to use something like Selenium.

CodePudding user response：

This is due to not getting response you exactly want.

https://cvetrends.com/

This website have java-script loaded content,so you will not get data in request.

instead of scraping website you will get data from https://cvetrends.com/api/cves/24hrs

here is some solution:

import requests
import json
from urlextract import URLExtract

LINK = "https://cvetrends.com/api/cves/24hrs"
PRE_LINK = "https://nvd.nist.gov/"
link_tweets = []
# library for url extraction
extractor = URLExtract()
# ectract response from LINK (json Response)
html = requests.get(LINK).text
# convert string to json object
twitt_json = json.loads(html)
twitt_datas = twitt_json.get('data')
for twitt_data in twitt_datas:
    # extract tweets
    twitts = twitt_data.get('tweets')
    for twitt in twitts:
        # extract tweet texts and validate condition
        twitt_text = twitt.get('tweet_text')
        if PRE_LINK in twitt_text:
            # find urls from text
            urls_list = extractor.find_urls(twitt_text)
            for url in urls_list:
                if PRE_LINK in url:
                    link_tweets.append(twitt_text)
print(link_tweets)