Home > OS >  Noob: Joined headers on webscrape, need to split into parent/child
Noob: Joined headers on webscrape, need to split into parent/child

Time:01-23

I'm simply compiling a list of phone brands, models, and prices etc from a telco website... and through trial and error I'm starting out ok (from a noob position) but the Headers are appearing as a joint string as it appears the html is designed to be individual and not parent/child or attributes per say?

Help/guidance would be appreciated...

Script: import pandas import requests from bs4 import BeautifulSoup

url = 'https://www.vodafone.com.au/mobile-phones'

html = requests.get(url)

soup = BeautifulSoup(html.text, 'html.parser')

headers = soup.find_all('h2') divs = soup.find_all('div')

model = list(map(lambda h: h.text.strip(), headers)) print(model)

Result: ['AppleiPhone 14 Pro Max', 'AppleiPhone 14 Pro', 'AppleiPhone 14 Plus', 'AppleiPhone 14', 'SamsungSamsung Galaxy Z Fold4 5G', 'SamsungSamsung Galaxy Z Flip4 5G', 'SamsungSamsung Galaxy S22 Ultra 5G', 'AppleiPhone 13', 'GoogleGoogle Pixel 7 Pro', 'GoogleGoogle Pixel 7', 'AppleiPhone 12', 'AppleiPhone 11', 'SamsungSamsung Galaxy S22 5G', 'SamsungSamsung Galaxy A13 5G', 'SamsungSamsung Galaxy A13 4G', 'GoogleGoogle Pixel 6 Pro', 'GoogleGoogle Pixel 6a', 'OPPOOPPO Find X5 Pro 5G', 'OPPOOPPO A57 4G', 'OPPOOPPO Reno8 5G', 'SamsungSamsung Galaxy A53 5G', 'SamsungSamsung Galaxy A33 5G', 'SamsungSamsung Galaxy Z Fold3 5G', 'SamsungSamsung Galaxy S21 5G', 'SamsungSamsung Galaxy S21 Ultra 5G', 'SamsungSamsung Galaxy S21 FE 5G', 'AppleiPhone SE (3rd gen)', 'TCLTCL 20 Pro 5G', 'MotorolaMotorola moto g62 5G', 'SamsungSamsung Galaxy A73 5G', 'OPPOOPPO Find X5 5G', 'OPPOOPPO Find X5 Lite 5G', 'MotorolaMotorola moto e22i 4G', 'MotorolaMotorola edge 30 pro 5G', 'MotorolaMotorola edge 30 5G', 'Why choose Vodafone?']

~Ideal result: Apple; iPhone 14 Pro Max, Apple; iPhone 14 Pro, Apple; iPhone 14 Plus, Apple; iPhone 14,

CodePudding user response:

You can find the manufacturer and device by selecting from the headers.

import requests
from bs4 import BeautifulSoup

url = 'https://www.vodafone.com.au/mobile-phones'

html = requests.get(url)

soup = BeautifulSoup(html.text, 'html.parser')

headers = soup.find_all('h2')

brand_device = []

for header in headers:
    manufacturer_div = header.select('div[class*="__Manufacturer-"]')
    device_div = header.select('div[class*="__Name-"]')
    if (len(manufacturer_div) > 0 and len(device_div) > 0):
        brand_device.append(
            f'{manufacturer_div[0].text.strip()};{device_div[0].text.strip()}')

print(brand_device)

Feel free to format the result for your need.

  • Related