Home > other >  Extract src of <img> element (selected by class) from multiple HTML pages
Extract src of <img> element (selected by class) from multiple HTML pages

Time:05-01

Example:

website has url https://images.com/Robots.aspx?ID=xxxx , where xxxx is an integer between 1 and 1935.

On each given page there can be an <img src="Images\Robots\{robotname}.png">. Not all pages have this element.

I need to extract all existing {robotname} variants and then download the images, but i'm struggling to understand how i can store the element in an object (Python or JS, for example).

How do i start / what i can read to do it?

CodePudding user response:

In Python you can use BeautifulSoup and extract all img tags soup.find_all("img") and manipulate the data from there

CodePudding user response:

  1. Download each page in a loop with AJAX.
  2. Parse the DOM with something like jsdom.
  3. Use a selector with [querySelectorAll()].(https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll) to get each image element.
  4. Use a regular expression on the image src-attribute to get the robot name. Like: $img.src.match(/([^\/] ).png$/i)[1].
  5. Download all the robots with AJAX.
  6. Combine robot name and downloaded robot to an object with key value pairs.

Let me know if you need more help or a code example.

  • Related