Home > Mobile >  Python scraping links from a webpage - Why no URLS?
Python scraping links from a webpage - Why no URLS?

Time:12-09

I am a seller on Target.com and am trying to scrape the URL for every product in my catalog using Python (Python 3). When I try this I get an empty list for 'urllist', and when I print the variable 'soup', what BS4 has actually collected is the contents "view page source" (forgive my naiveté here, definitely a novice at this still!). In reality I'd really like to be scraping URLs from the content found in the "elements" section of the Devtools page. I can sift through the html on that page manually and find the links, so I know they're in there...I just don't know enough yet to tell BS4 that's the content I want to search. How can I do that?

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
#Need this part below for HTTPS
ctx=ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#Needs that context = ctx line to deal with HTTPS
url = input('Enter URL: ')
urllist=[]
html = urllib.request.urlopen(url, context = ctx).read()
soup=BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
    urllist.append(link.get('href'))
print(urllist)

If it helps, I found code that someone developed in Java that can be run from the developer console that works and grabbed me all of my links. But my goal is to be able to do this in Python (Python 3)

var x = document.querySelectorAll("a");
var myarray = []
for (var i=0; i<x.length; i  ){
var nametext = x[i].textContent;
var cleantext = nametext.replace(/\s /g, ' ').trim();
var cleanlink = x[i].href;
myarray.push([cleantext,cleanlink]);
};
function make_table() {
    var table = '<table><thead><th>Name</th><th>Links</th></thead><tbody>';
   for (var i=0; i<myarray.length; i  ) {
            table  = '<tr><td>'  myarray[i][0]   '</td><td>' myarray[i][1] '</td></tr>';
    };
 
    var w = window.open("");
w.document.write(table); 
}
make_table()

CodePudding user response:

I suspect this is occurring because Target's website (at least, the main page) builds the page content via Javascript. Your browser is able to render the page's source code, but your python code does no such thing. See this post for help in that regard.

CodePudding user response:

Without going into the specifics of your code, fundamentally, if you can make a call to a url - you've got that url. If you use the script to scrape one entered url at the time - that could be logged by entering the correct amendment to the urllist entry (the object returned by each .link.get('href')).
If you have some other original source (a list?) for the urls to scrape, that could be added to the urllist.-object in a similar fashion.

The course of action choosen depends on the actual data structure returned by .link.get('href')). Suggestions:

  • If it's a string containing html, put that string in a dict key 'html', and add another dict key 'url'
  • If it's already a dict object: Just add a key-value-pair 'url'.
  • If you want to enter one url and extract the others from the url's html document, retreive the html and parse it with something like ElementTree

You can do this a number of ways.

  • Related