Home > other >  How to get URL from two dropdown lists (webscraping with python)
How to get URL from two dropdown lists (webscraping with python)

Time:07-27

I want to webscrape this webpage (www.autocar.co.uk). Therefore, I want to select each car manufacturer in a drop down menu and the model to get the HREF/reference to the model website and then retrieve some information from each model page (not reflected in the code yet)

As I just started coding I would higly appreciate your input! Thanks in advance!! :)

Desired output:

https://www.autocar.co.uk/car-review/abarth/595
https://www.autocar.co.uk/car-review/abarth/595-competizione
https://www.autocar.co.uk/car-review/abarth/124-spider-2016-2019
https://www.autocar.co.uk/car-review/abarth/695-biposto-2015-2016
https://www.autocar.co.uk/car-review/ac-schnitzer/acs3-sport
https://www.autocar.co.uk/car-review/ac-schnitzer/acs1
https://www.autocar.co.uk/car-review/ac-schnitzer/acs5-sport
https://www.autocar.co.uk/car-review/allard/j2x-mkii
https://www.autocar.co.uk/car-review/alfa-romeo/giulia
https://www.autocar.co.uk/car-review/alfa-romeo/tonale

Output as of now --> we need to remove the "https://www.autocar.co.uk0":

https://www.autocar.co.uk0
https://www.autocar.co.uk/car-review/abarth/595
https://www.autocar.co.uk/car-review/abarth/595-competizione
https://www.autocar.co.uk/car-review/abarth/124-spider-2016-2019
https://www.autocar.co.uk/car-review/abarth/695-biposto-2015-2016
https://www.autocar.co.uk0
https://www.autocar.co.uk/car-review/ac-schnitzer/acs3-sport
https://www.autocar.co.uk/car-review/ac-schnitzer/acs1
https://www.autocar.co.uk/car-review/ac-schnitzer/acs5-sport
https://www.autocar.co.uk0
https://www.autocar.co.uk/car-review/allard/j2x-mkii
https://www.autocar.co.uk0
https://www.autocar.co.uk/car-review/alfa-romeo/giulia
https://www.autocar.co.uk/car-review/alfa-romeo/tonale

Code as of now:

from bs4 import BeautifulSoup
import requests
import pandas as pd

#Inputs/URLs to scrape: 
url = "http://www.autocar.co.uk/"
s = requests.Session()

r = s.get(url)
soup = BeautifulSoup(r.text,'html.parser')
full_car_list = []

car_list = [(x.text, x.get("value"), f'https://www.autocar.co.uk/ajax/car-models/{x.get("value")}/0') for x in soup.select_one('#edit-make').select('option')]
for x in car_list:
    r = s.get(x[2])
    try:
        for item in r.json()['options'].items():
            #Car Model
            car_model_url = (f'https://www.autocar.co.uk{item[0]}')
            print(car_model_url)
            
    except Exception as e:
        full_car_list.append((x[0], 'no models', f'https://www.autocar.co.uk/vehicles/{x[0]}'))

CodePudding user response:

You'll want to refactor things into a couple of functions for clarity; that also makes it easier to skip data that isn't valid (apparently occasionally you'd get a list from the ajax/car-models API):

from bs4 import BeautifulSoup
import requests

sess = requests.Session()


def get_make_info():
    resp = sess.get("http://www.autocar.co.uk/")
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, 'html.parser')
    for option in soup.select('#edit-make option'):
        make_id = option['value']
        yield (make_id, option.text)


def get_make_models(make_id):
    info_url = f'https://www.autocar.co.uk/ajax/car-models/{make_id}/0'
    resp = sess.get(info_url)
    resp.raise_for_status()
    data = resp.json()
    options = data['options']
    if isinstance(options, list):  # Invalid format, skip
        return
    for model_url, model_name in options.items():
        if model_url == "0":  # "All models"
            continue
        model_url = f'https://www.autocar.co.uk{model_url}'
        yield (model_url, model_name)


for make_id, make_name in get_make_info():
    for model_url, model_name in get_make_models(make_id):
        print(make_id, make_name, model_url, model_name)

CodePudding user response:

Using the code as written for your previous question, all you have to do is print out the 'Url' column of the dataframe:

import requests
from bs4 import BeautifulSoup 
import pandas as pd

url = "http://www.autocar.co.uk/"
s = requests.Session()

r = s.get(url)
soup = BeautifulSoup(r.text,'html.parser')
full_car_list = []
car_list = [(x.text, x.get("value"), f'https://www.autocar.co.uk/ajax/car-models/{x.get("value")}/0') for x in soup.select_one('#edit-make').select('option')]
for x in car_list:
    r = s.get(x[2])
    try:
        for item in r.json()['options'].items():
            full_car_list.append((x[0], item[1], f'https://www.autocar.co.uk{item[0]}'))
    except Exception as e:
        full_car_list.append((x[0], 'no models', f'https://www.autocar.co.uk/vehicles/{x[0]}'))
cars_df = pd.DataFrame(full_car_list[1:], columns = ['Make', 'Model', 'Url'])
cars_df = cars_df[cars_df.Model != 'All models']
cars_df.to_csv('makes_models.csv')
for x in cars_df.Url.tolist():
    print(x)
  • Related