Webscraping BeautifulSoup-CodePudding

I have a schoolproject where I have to scrape about 88 000 companies in Belgium. Some websites take a bit longer to find a word on the site, so I assume this is just a bigger website. However, when I get to the website of DHL (www.dhl.com), my program does not do anything. Is there a reason this is not possible or can a company disable scraping their website? I do not think there is anything wrong in my code, but I've placed it below. The variable websites is just an array with all the URL's of the companies.

counter = 0
url = ""
for url in websites:
  counter  = 1
  word = 'de'

  print(f'{counter}: {url}')

  try:
    r = req.get('http://' url.strip())
    r.encode = 'utf-8'

    html_content = r.text

    soup = bs(html_content, 'lxml')
    to_search = soup.find_all(text = lambda text : text and word.lower() in text)

    if len(to_search) > 0:
      print(f'\tamount of "{word}" on website: {len(to_search)}')
    else:
      print(f'\t"{word}" never occured')

  except:
    print(f'\t{url} is unavailable')

CodePudding user response：

Yes, companies can prevent scrapping by detecting a bot. If you go to the dhl site though, it seems to redirect so not sure if that's the issue. When I run you script, it just stalls.

This does also raise another question though. When you search for the word "de", are you specifically looking for ONLY 'de'? Because as you have it coded, it will return any string that contains 'de'. For example, with this code for the dhl.com site, it finds ''bazadebezolkohpepadr="1231685289"'' as 1 of the 26 'de''s it counts.

And to that point, that will also include any finctions or scripts in the html as well. For example, this also is counted as a '"de"` word for dhl.com:

'!function(a){var e="https://s.go-mpulse.net/boomerang/",t="addEventListener";if("True"=="True")a.BOOMR_config=a.BOOMR_config||{},a.BOOMR_config.PageParams=a.BOOMR_config.PageParams||{},a.BOOMR_config.PageParams.pci=!0,e="https://s2.go-mpulse.net/boomerang/";if(window.BOOMR_API_key="RSVGU-547KJ-ZUMZD-ZW27F-P4RHY",function(){function n(e){a.BOOMR_onload=e&&e.timeStamp||(new Date).getTime()}if(!a.BOOMR||!a.BOOMR.version&&!a.BOOMR.snippetExecuted){a.BOOMR=a.BOOMR||{},a.BOOMR.snippetExecuted=!0;var i,_,o,r=document.createElement("iframe");if(a[t])a[t]("load",n,!1);else if(a.attachEvent)a.attachEvent("onload",n);r.src="javascript:void(0)",r.title="",r.role="presentation",(r.frameElement||r).style.cssText="width:0;height:0;border:0;display:none;",o=document.getElementsByTagName("script")[0],o.parentNode.insertBefore(r,o);try{_=r.contentWindow.document}catch(O){i=document.domain,r.src="javascript:var d=document.open();d.domain=\'" i "\';void(0);",_=r.contentWindow.document}_.open()._l=function(){var a=this.createElement("script");if(i)this.domain=i;a.id="boomr-if-as",a.src=e "RSVGU-547KJ-ZUMZD-ZW27F-P4RHY",BOOMR_lstart=(new Date).getTime(),this.body.appendChild(a)},_.write("<bo" \'dy onl oad="document._l();">\'),_.close()}}(),"".length>0)if(a&&"performance"in a&&a.performance&&"function"==typeof a.performance.setResourceTimingBufferSize)a.performance.setResourceTimingBufferSize();!function(){if(BOOMR=a.BOOMR||{},BOOMR.plugins=BOOMR.plugins||{},!BOOMR.plugins.AK){var e=""=="true"?1:0,t="",n="uxq4jklij32vkyqodtiq-f-5a918ce10-clientnsv4-s.akamaihd.net",i="false"=="true"?2:1,_={"ak.v":"32","ak.cp":"640765","ak.ai":parseInt("326248",10),"ak.ol":"0","ak.cr":100,"ak.ipv":4,"ak.proto":"http/1.1","ak.rid":"1eef41a0","ak.r":42169,"ak.a2":e,"ak.m":"a","ak.n":"essl","ak.bpcip":"165.225.196.0","ak.cport":60467,"ak.gh":"23.52.43.86","ak.quicv":"","ak.tlsv":"tls1.3","ak.0rtt":"","ak.csrc":"-","ak.acc":"bbr","ak.t":"1645092049","ak.ak":"hOBiQwZUYzCg5VSAfCLimQ==gPk5vdO2N xaMvOHluPHPA75P2qfqUmFzBjJVMEHxXiQzYo6BKsg3x M21NSbrw9cCziPvIOGYUIJBB4srzYCVE S3n8bUaeN3xk/ pwl6 CHMsGzVTAqySt3i0ewJ1GI5g6CypA2BL8kALrH1Z8sCNqZy0eMS1nmkOjppm315f4Qf l8qhUvekcsNZ3NALIP0UqjlcEZtDpHsWEC AoFQQTILEGMxqi4IctzOqS3RG 5KW2/TQ7YbksSvp0tOo9JIjguIAqM2gK8d/uOsFF2CIlIYFYtXQCoDnlosdxnVQjkIGLqOVcM81NauDexVgIRgCIbQWCO1VA2ZuuapLm5Jc7Noh83C3XodGjRNlS9oiLRZT7YI z9GVtfANw8nVQn7TMhIt8F9KlMmzcHkKEwBFhXYozOkhx0YluLmLTaj8=","ak.pv":"278","ak.dpoabenc":"","ak.tf":i};if(""!==t)_["ak.ruds"]=t;var o={i:!1,av:function(e){var t="http.initiator";if(e&&(!e[t]||"spa_hard"===e[t]))_["ak.feo"]=void 0!==a.aFeoApplied?1:0,BOOMR.addVar(_)},rv:function(){var a=["ak.bpcip","ak.cport","ak.cr","ak.csrc","ak.gh","ak.ipv","ak.m","ak.n","ak.ol","ak.proto","ak.quicv","ak.tlsv","ak.0rtt","ak.r","ak.acc","ak.t","ak.tf"];BOOMR.removeVar(a)}};BOOMR.plugins.AK={akVars:_,akDNSPreFetchDomain:n,init:function(){if(!o.i){var a=BOOMR.subscribe;a("before_beacon",o.av,null,null),a("onbeacon",o.rv,null,null),o.i=!0}return this},is_complete:function(){return!0}}}}()}(window);'

A few things I'd change here:

change to 'https://'
include user agent in the headers
include the timeout parameter in the request (doesn't fix the issue other than your code won't hang

With that being said, the code bellow found amount of "de" on website: 21 for dhl:

import requests as req
from bs4 import BeautifulSoup as bs

websites = ['www.dhl.com']
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}


counter = 0
url = ""
for url in websites:
  counter  = 1
  word = 'de'

  print(f'{counter}: {url}')

  try:
    r = req.get('https://' url.strip(), headers=headers, timeout=10)
    r.encode = 'utf-8'

    html_content = r.text

    soup = bs(html_content, 'lxml')
    to_search = soup.find_all(text = lambda text : text and word.lower() in text)

    if len(to_search) > 0:
      print(f'\tamount of "{word}" on website: {len(to_search)}')
    else:
      print(f'\t"{word}" never occured')

  except:
    print(f'\t{url} is unavailable')

Output:

amount of "de" on website: 21