I'm trying to retrieve all URLs from a page using Python's Requests library. I can't figure out why my filterer is returning hundreds of items more than I am expecting. Code:
import requests
import re
r = requests.get('http://exrx.net/Lists/ExList/NeckWt', headers=headers_dict, timeout=3)
counter = 0
raw_html = r.text
listly = re.split('\"', raw_html)
for i in listly:
if "https://exrx.net" in i or "../../" in i:
pass
else:
listly.remove(i)
counter = 1
print(listly)
print('-'*5)
print('the list is now', len(listly), 'objects long')
print(counter, ' objects were removed')
print('-'*5)
The final list however contains 487 items (down from >900), including the following, which are confusingly not specified in my if / elif block. I cannot figure out why they are not being deleted:
['en', 'Content-Type', 'text/html; charset=utf-8', '... func = ', '... func.apply: ', "----- F'D: ", '... file = ', "----- ERR'D: ", "----- F'D: ", '', 'load', '_', ' blocked = TIME DELAY!', ' blocked = ', ' blocked = ', 'markLoaded dummyfile: ', '1', "let's go", 'on', 'on', 'on', 'on', 'script', 'text/javascript', 'head', '/detroitchicago/grapefruit.gif', 'prerender', '?orig=', '&v=', '/porpoiseant/army.gif', 'compid', '0', '', 'impression', '', 'impression', 'prerender', '?orig=', '&sts=', 'domain_id', '&visit_uuid=', 'undefined', 'false', 'false', 'function', 'CustomEvent', 'false', 'false', 'content-type', 'text/html; charset=UTF-8', 'generator', 'concrete5', 'shortcut icon', 'https://exrx.net/application/files/8014/4923/2704/Runner3.jpg', 'image/x-icon', 'icon', 'https://exrx.net/application/files/8014/4923/2704/Runner3.jpg', 'image/x-icon', 'canonical', 'https://exrx.net/Lists/ExList/NeckWt', 'text/javascript', '/index.php', '/updates/concrete5-8.5.7/concrete/images', '/index.php/tools/required', 'https://exrx.net', '', 'en_US', 'text/css', 'Logo', '79715', '3471', 'text/javascript', 'https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.min.js?ccm_nocache=1a72ca0f3692b16db9673a9a89faff0649086c52', 'text/javascript', '/updates/concrete5-8.5.7/concrete/js/ie/html5-shiv.js?ccm_nocache=1a72ca0f3692b16db9673a9a89faff0649086c52', 'text/javascript', '/updates/concrete5-8.5.7/concrete/js/ie/respond.js?ccm_nocache=1a72ca0f3692b16db9673a9a89faff0649086c52', 'text/javascript', '', 'touchstart', 'https://fonts.googleapis.com/css?family=Source Sans Pro:300,400,700,900', 'stylesheet', 'text/css', '/application/files/cache/css/fruitful/iGotStyle.css?ts=1644387679', 'stylesheet', 'text/css', 'all', 'viewport', 'width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no', '/application/files/cache/css/fruitful/accessory.css?ts=1644387679', 'stylesheet', 'text/css', 'all', 'https://use.fontawesome.com/bf47fdcc0a.js', '', 'text/css', '', '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js', 'ca-pub-6329449765532083', 'text/javascript', '1', 'https://exrx.net/Lists/ExList/NeckWt', 'false', 'false', 'text/javascript', 'false', 'ad_cache_level', 'ad_lazyload_version', 'ad_load_version', 'city', 'Sydney', 'country', 'AU', 'days_since_last_visit', 'domain_id', 'domain_test_group', 'engaged_time_visit', 'ezcache_level', 'ezcache_skip_code', 'form_factor_id', 'framework_id', 'is_return_visitor', 'is_sitespeed', 'last_page_load', '', 'last_pageview_id', '', 'lt_cache_level', 'metro_code', 'page_ad_positions', '', 'page_view_count', 'page_view_id', '578b3a09-c637-461b-4c42-c6c83546001c', 'position_selection_id', 'postal_code', '2000', 'pv_event_count', 'response_size_orig', 'response_time_orig', 'serverid', '54.66.141.238:27055', 'state', 'NSW', 't_epoch', 'template_id', 'time_on_site_visit', 'url', 'https://exrx.net/Lists/ExList/NeckWt', 'user_id', 'weather_precipitation', 'weather_summary', '', 'weather_temperature', 'word_count', 'worst_bad_word_level', '&ez_orig=1', 'expires=', 'ezux_lpl_107151=', '|', '|', '; ', 'complete', 'onload', 'attach_ezolpl', 'attach_ezolpl', '578b3a09-c637-461b-4c42-c6c83546001c', 'false', 'page527', 'ccm-page ccm-page-id-527 page-type-page page-template-directory-template', 'siteHeader', 'container', 'row', 'logo', 'col-xs-6 col-md-3', 'ccm-custom-style-container ccm-custom-style-logo-79715', 'https://exrx.net/', '/application/files/3114/3635/4565/logo_same_proportion_5_2_2015.gif', 'ExRx.net: Exercise Prescription on Internet', 'ccm-image-block img-responsive bID-79715', 'mainNav', 'clearfix hidden-xs hidden-sm col-sm-9', 'nav', '', 'https://exrx.net/Lists/Directory', '_self', '', '', '/Lists/Directory', '_self', '', '', '/WeightTraining/Instructions', '_self', '', '', '/Lists/Muscle', '_self', '', '', '/Lists/Articulations', '_self', '', '', '/Calculators', '_self', '', '', 'https://exrx.net/Beginning', '_self', '', '', '/Beginning', '_self', '', '', '/WeightTraining', '_self', '', '', '/Kinesiology', '_self', '', '', '/Aerobic', '_self', '', '', '/ExInfo', '_self', '', '', '/Sports', '_self', '', '', '/Bodybuilding', '_self', '', '', '/Drugs', '_self', '', '', '/Psychology', '_self', '', '', '/FatLoss', '_self', '', '', '/Nutrition', '_self', '', '', '/Testing', '_self', '', '', 'https://exrx.net/Notes/SiteJournal', '_self', '', '', '/Notes/SiteJournal', '_self', '', '', '/People/Contact', '_self', '', '', '/Notes/Feedback', '_self', '', '', '/Notes/Archive/Feedback10', '_self', '', '', '/Questions', '_self', '', '', '/forum/', '_blank', '', '', '/Links', '_self', '', '', '/Abstracts', '_self', '', '', '/Journals', '_self', '', '', '/Videos', '_self', '', '', '/Talks', '_self', '', '', '/Notes/Donations', '_self', '', '', 'https://exrx.net/Store', '_self', '', 'mobileAssets', 'col-xs-6 visible-xs-block visible-sm-block text-right', 'icoMobileNav', 'fa fa-bars', 'text/javascript', '/packages/fruitful/themes/fruitful/js/initExRx.js', 'headerShell', 'container', 'row', 'col-sm-12', 'fruitful-page-title fruitfull-title-padding', 'page-title', 'row Breadcrumb-Container Add-Margin-Top', 'container', 'col-sm-9', 'http://exrx.net', '../Directory', 'col-sm-3', 'google_translate_element', 'text/javascript', 'mainShell', 'container ', 'row', 'col-sm-12', 'ccm-custom-style-container ccm-custom-style-directorytopadvertise-86906 Add-Margin-Bottom', 'ezoic-pub-ad-placeholder-103', '', '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js', 'adsbygoogle', 'display:block; height:90px;', 'ca-pub-6329449765532083', '4409668012', 'container', 'row', 'col-sm-12', 'Sternocleidomastoid', '../../Muscles/Sternocleidomastoid', 'container', 'row', 'col-sm-12', 'row', 'col-sm-6', '../../WeightExercises/Sternocleidomastoid/CBNeckFlx', '../../WeightExercises/Sternocleidomastoid/CBNeckFlxBelt', '../../WeightExercises/Sternocleidomastoid/CBNeckRotationBelt', '../../WeightExercises/Sternocleidomastoid/CBNeckLtrFlxBelt', '_top', '../../WeightExercises/Sternocleidomastoid/LVNeckFlexionH', '_top', '../../WeightExercises/Sternocleidomastoid/LVLateralNeckFlexionH', '_top', '../../WeightExercises/Sternocleidomastoid/LVNeckFlx', '_top', '../../WeightExercises/Sternocleidomastoid/LVNeckLtrFlx', '_top', '../../WeightExercises/Sternocleidomastoid/WtLyingNeckFlexion', '../../WeightExercises/Sternocleidomastoid/WtNeckFlx', '_top', '../../WeightExercises/Sternocleidomastoid/WtNeckLateralFlex', '_top', 'col-sm-6', '../../WeightExercises/Sternocleidomastoid/BWFrontNeckBridge', '../../WeightExercises/Sternocleidomastoid/BWWallFrontNeckBridge', '../../WeightExercises/Sternocleidomastoid/BWWallSideNeckBridge', '../../Stretches/Sternocleidomastoid/NeckRetraction', '../../Stretches/Sternocleidomastoid/NeckRotation', 'https://exrx.net/WeightExercises/Sternocleidomastoid/STNeckFlexion', 'https://exrx.net/WeightExercises/Sternocleidomastoid/STNeckLateralFlexion', '', '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js', 'adsbygoogle', 'display:inline-block;width:300px;height:250px', 'ca-pub-6329449765532083', '2861896011', 'container', 'row', 'col-sm-12', 'Splenius', '../../Muscles/Splenius', 'container', 'row', 'col-sm-12', 'row', 'col-sm-6', '../../WeightExercises/Splenius/CBNeckExt', '_top', '../../WeightExercises/Splenius/CBNeckExtBelt', '../../WeightExercises/Splenius/LVNeckExtentionH', '../../WeightExercises/Splenius/LVNeckExt', '_top', '../../WeightExercises/Splenius/WtLyingNeckExtension', '../../WeightExercises/Splenius/WtNeckExtension', '../../WeightExercises/Splenius/WtNeckExt', '_top', '../../WeightExercises/Splenius/WtNeckHarnessExt', '#Sternocleidomastoid', 'col-sm-6', 'https://exrx.net/WeightExercises/Splenius/BRNeckRetraction', '../../WeightExercises/Splenius/BWRearNeckBridge', '../../WeightExercises/Splenius/BWWallRearNeckBridge', '../../WeightExercises/Splenius/LyingIsometricNeckRetr', '../../Stretches/Splenius/Neck', 'https://exrx.net/WeightExercises/Splenius/STNeckExtension', '../../Stretches/ErectorSpinae/Plow', 'WaistWt#Erector', 'container', 'row', 'col-sm-12', 'BackWt', 'BackWt#UpperTrap', 'WaistWt', 'WaistWt#Erector', 'container', 'row', 'col-sm-12 Add-Margin-Top', 'container', 'subfooter no-print', 'text-align: center;', 'text-align: center;', '../../Lists/Directory', '../../Notes/Notes', '_parent', 'site-footer', 'container ', 'row', 'copyright', 'col-xs-12 col-sm-3', 'col-xs-12 col-sm-9', 'margin:0px !important', 'https://exrx.net/People/Contact', 'https://exrx.net/Notes/Privacy', 'https://exrx.net/Notes/Legal', 'https://exrx.net/Notes/ADA', 'https://www.facebook.com/pages/ExRxnet/1685475628344232', 'https://exrx.net/Notes/Feedback', 'ajax', 'https://exrx.net/Notes/Archive/Feedback1', 'https://exrx.net/Store', 'amzn-assoc-ad-d457ebf0-12d4-46d4-a3f1-6d2aa75f0d88', '', '//z-na.amazon-adsystem.com/widgets/onejs?MarketPlace=US&adInstanceId=d457ebf0-12d4-46d4-a3f1-6d2aa75f0d88', '/packages/fruitful/themes/fruitful/js/functions.js', 'text/javascript', '', '/packages/fruitful/themes/fruitful/js/bootstrap.min.js', 'text/javascript', '', 'text/javascript', '', '#mainNav', 'body', 'id', 'mobileNav', 'visible-xs-block visible-sm-block', 'hidden-xs hidden-sm', '#icoMobileNav', '.ccm-page, #mobileNav', 'slideOver', 'text/javascript', '/updates/concrete5-8.5.7/concrete/js/picturefill.js?ccm_nocache=1a72ca0f3692b16db9673a9a89faff0649086c52', 'exrx_net', 'audins.js', '__ez.script.add', '//go.ezoic.net/detroitchicago/audins.js?cb=195-3', 'display:none;', '//pixel.quantserve.com/pixel/p-31iz6hfFutd16.gif?labels=Domain.exrx_net,DomainId.107151', '0', '1', '1', 'Quantcast', 'text/javascript', 'false']
CodePudding user response:
Generally, you are not permitted to remove elements from a list while iterating through it, which you are doing in your for
loop. Instead, try adding the desired elements in another list, or use list compression.
Example of list comprehension:
listly = [s for s in listly if "https://exrx.net" in s or "../../" in listly]
CodePudding user response:
Take a look as BeautifulSoup, the main Python web scraping library. The best way imo to get all the links on the page is by doing something like:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
req = Request("http://exrx.net/Lists/ExList/NeckWt")
page_source = urlopen(req)
soup = BeautifulSoup(page_source , "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
This would get all the links on the page without you manually having to deal with manually parsing the HTML of the page.