Home > Net >  How to capture data from website as key-value pairs from the website using python?
How to capture data from website as key-value pairs from the website using python?

Time:12-25

enter code here
test_link = 'https://www.amd.com/en/products/cpu/amd-ryzen-9-3900xt'
r = requests.get(test_link, headers=headers)
soup = BeautifulSoup(r.content,'lxml')
whole_data = soup.find('div', class_='fieldset-wrapper')
specifications = []
specifications_value=[]
for variable1 in whole_data.find_all('div', class_='field__label'):
    #print(variable1.text)
    variable1 = variable1.text
    specifications = list(variable1.split('\n'))
    #print(specifications)
for variable2 in whole_data.find_all('div', class_='field__item'):
    #print(variable2.text)
    variable2 = variable2.text
    specifications_value = list(variable2.split('\n'))
    #print(specifications_value)

issue:i am getting the data, but in separate variables and for loops, how to map these two variable using key-value pairs? so that i can check conditions like: if the value is platform then only tale it's value(box processor)

i want to capture the data in such a way that if the 'key' is platform then only capture it's value(boxed processor). similarly for all other 14 tags.

CodePudding user response:

You can iterate over a list of expected keys and use :-soup-contains to target the description node. If that is not None then select the child values. Otherwise, return ''.

import requests
from bs4 import BeautifulSoup as bs

links = ['https://www.amd.com/en/products/cpu/amd-ryzen-7-3800xt',
         'https://www.amd.com/en/products/cpu/amd-ryzen-9-3900xt']

all_keys = ['Platform', 'Product Family', 'Product Line', '# of CPU Cores',
            '# of Threads', 'Max. Boost Clock', 'Base Clock', 'Total L2 Cache', 'Total L3 Cache',
            'Default TDP', 'Processor Technology for CPU Cores', 'Unlocked for Overclocking', 'CPU Socket',
            'Thermal Solution (PIB)', 'Max. Operating Temperature (Tjmax)', 'Launch Date', '*OS Support']

with requests.Session() as s:

    s.headers = {'User-Agent': 'Mozilla/5.0'}

    for link in links:

        r = s.get(link)
        soup = bs(r.content, 'lxml')
        specification = {}

        for key in all_keys:

            spec = soup.select_one(
                f'.field__label:-soup-contains("{key}")   .field__item, .field__label:-soup-contains("{key}")   .field__items .field__item')

            if spec is None:
                specification[key] = ''
            else:
                if key == '*OS Support':
                    specification[key] = [
                        i.text for i in spec.parent.select('.field__item')]
                else:
                    specification[key] = spec.text

        print(specification)
        print()
  • Related