Home > Blockchain >  Scraping Hidden Data using Python
Scraping Hidden Data using Python

Time:04-12

I have written the following code to scrape data:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://buyacp.com/parts/bumper-rear-primered-finish-fk-bb004/'

r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

Product_Data = {
'SKU': soup.dd.text,

'name': soup.h1.text,

'description': soup.find('div', {"class":"tabs-contents"}),

}
url_list.append(Product_Data)
return`

I have tried many variations of the code to be able to scrape the description data however I only get the data from the last tab (warranty info) which is not what I am after. The data for Description and Fitment seems to be hidden and I can't figure out what to do to be able to scrape this information.

Can someone point me in the right direction to be able to make this possible?

Thanks!

CodePudding user response:

The reason you cannot access that data, as hinted at, is because that information is not loaded on page load, but is actually loaded into the tab elements from the JavaScript. This is why, like Andrej said, that information is in the script tags.

To resolve this, you can either pull that information from those script tags (again, like Andrej said) or use a Python library that allows you to load those tags. My library of choice is "Requests-HTML".

The below code will pull the information you want, although I am not sure what you want to do with the tab information so I provide no filtering:

# Will need to install 'requests-html'
from requests_html import HTMLSession

# Assign the URL,
# create the HTMLSession object,
# and run the "get" method to retrieve information from the URL
url = 'https://buyacp.com/parts/bumper-rear-primered-finish-fk-bb004/'
session = HTMLSession()
response = session.get(url)

# Check that the resolution code was 200
# (successfully retrieved info from URL)
res_code = response.status_code
if res_code == 200:
    response.html.render() # This is the critical line. This render method runs the script tags to turn them into HTML

    # Get the item SKU and Name from the html
    # Note: the "html.find()" method takes CSS selectors
    item_sku = response.html.find("dd[itemprop='sku']", first=True).text
    item_name = response.html.find("h1[itemprop='name']", first=True).text

    # Get the tab content and put it into a dictionary
    tabs = {}
    for tab in response.html.find("div.tab-content"):
        tab_name = tab.find("h4", first=True).text
        tab_text = tab.find(".collapsibleMobile-content ", first=True).text

        tabs[tab_name] = tab_text

    item_info = {
        'SKU': item_sku,
        'Name': item_name,
        "Tabs": tabs
    }

    print(item_info)
    
else:
    print("Could not reach web page!")

CodePudding user response:

The data you see on the page is stored inside HTML in <script> tag. To decode it you can use next example:

import re
import json
import requests

url = "https://buyacp.com/parts/bumper-rear-primered-finish-fk-bb004/"

data = re.search(
    r'window\.stencilBootstrap\("product", (".*")', requests.get(url).text
).group(1)
data = json.loads(json.loads(data))

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for f in data["productCustomFields"]:
    print("{:<20} {}".format(f["name"], f["value"]))

Prints:

Oem#                 D0DZ-17906-A
Part Type            Bumper
Bullet 1             Replaces a rusted, damaged, or missing bumper to restore original appearance
Bullet 2             Constructed from steel; Primer finish allows for easy paint prep - color matched bumpers give the car a modern look
Bullet 3             Made to replicate the factory's original bumper fitment
Bullet 4             Direct fit for 70-73 cars
Bullet 5             Can be adapted to fit 74-77 using earlier model Maverick bumper brackets, lower valance, and quarter panel extensions as well as inner trunk brackets from a 69-70 Mustang (some welding required)
Brand Title 1        Quality Guarantee
Brand Copy 1         All of ACP’s products are precisely crafted with high grade materials and all-new advanced tooling. Each item is tested and quality checked for accurate appearance and functionality before it reaches you.
Brand Title 2        Expertly Crafted
Brand Copy 2         ACP’s exclusive control over manufacturing ensures that all of our products are designed and crafted with thorough precision and expertise.
Brand Title 3        Exact OE Specifications
Brand Copy 3         ACP strives to match or exceed OE Correct standards for each of our products to maintain the original factory look.
Partlocation         Rear
Material             Steel
Color Finish         Primer
Spec Size            Null
Prop 65              Yes
Plpymm 1             1970-1973 Ford Maverick, 1971-1973 Mercury Comet
Categories           Body
Part Group           Bumper
Make                 Mercury
Make                 Ford
Model                Comet
Model                Maverick
Year                 1973
Year                 1972
Year                 1971
Year                 1970
  • Related