I want to webscrape this webpage (www.autocar.co.uk). Therefore, I want to select each car manufacturer in a drop down menu and the model to get the HREF/reference to the model website and then retrieve some information from each model page (not reflected in the code yet)
As I just started coding I would higly appreciate your input! Thanks in advance!! :)
Desired output:
https://www.autocar.co.uk/car-review/abarth/595
https://www.autocar.co.uk/car-review/abarth/595-competizione
https://www.autocar.co.uk/car-review/abarth/124-spider-2016-2019
https://www.autocar.co.uk/car-review/abarth/695-biposto-2015-2016
https://www.autocar.co.uk/car-review/ac-schnitzer/acs3-sport
https://www.autocar.co.uk/car-review/ac-schnitzer/acs1
https://www.autocar.co.uk/car-review/ac-schnitzer/acs5-sport
https://www.autocar.co.uk/car-review/allard/j2x-mkii
https://www.autocar.co.uk/car-review/alfa-romeo/giulia
https://www.autocar.co.uk/car-review/alfa-romeo/tonale
Output as of now --> we need to remove the "https://www.autocar.co.uk0":
https://www.autocar.co.uk0
https://www.autocar.co.uk/car-review/abarth/595
https://www.autocar.co.uk/car-review/abarth/595-competizione
https://www.autocar.co.uk/car-review/abarth/124-spider-2016-2019
https://www.autocar.co.uk/car-review/abarth/695-biposto-2015-2016
https://www.autocar.co.uk0
https://www.autocar.co.uk/car-review/ac-schnitzer/acs3-sport
https://www.autocar.co.uk/car-review/ac-schnitzer/acs1
https://www.autocar.co.uk/car-review/ac-schnitzer/acs5-sport
https://www.autocar.co.uk0
https://www.autocar.co.uk/car-review/allard/j2x-mkii
https://www.autocar.co.uk0
https://www.autocar.co.uk/car-review/alfa-romeo/giulia
https://www.autocar.co.uk/car-review/alfa-romeo/tonale
Code as of now:
from bs4 import BeautifulSoup
import requests
import pandas as pd
#Inputs/URLs to scrape:
url = "http://www.autocar.co.uk/"
s = requests.Session()
r = s.get(url)
soup = BeautifulSoup(r.text,'html.parser')
full_car_list = []
car_list = [(x.text, x.get("value"), f'https://www.autocar.co.uk/ajax/car-models/{x.get("value")}/0') for x in soup.select_one('#edit-make').select('option')]
for x in car_list:
r = s.get(x[2])
try:
for item in r.json()['options'].items():
#Car Model
car_model_url = (f'https://www.autocar.co.uk{item[0]}')
print(car_model_url)
except Exception as e:
full_car_list.append((x[0], 'no models', f'https://www.autocar.co.uk/vehicles/{x[0]}'))
CodePudding user response:
You'll want to refactor things into a couple of functions for clarity; that also makes it easier to skip data that isn't valid (apparently occasionally you'd get a list from the ajax/car-models
API):
from bs4 import BeautifulSoup
import requests
sess = requests.Session()
def get_make_info():
resp = sess.get("http://www.autocar.co.uk/")
resp.raise_for_status()
soup = BeautifulSoup(resp.text, 'html.parser')
for option in soup.select('#edit-make option'):
make_id = option['value']
yield (make_id, option.text)
def get_make_models(make_id):
info_url = f'https://www.autocar.co.uk/ajax/car-models/{make_id}/0'
resp = sess.get(info_url)
resp.raise_for_status()
data = resp.json()
options = data['options']
if isinstance(options, list): # Invalid format, skip
return
for model_url, model_name in options.items():
if model_url == "0": # "All models"
continue
model_url = f'https://www.autocar.co.uk{model_url}'
yield (model_url, model_name)
for make_id, make_name in get_make_info():
for model_url, model_name in get_make_models(make_id):
print(make_id, make_name, model_url, model_name)
CodePudding user response:
Using the code as written for your previous question, all you have to do is print out the 'Url' column of the dataframe:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://www.autocar.co.uk/"
s = requests.Session()
r = s.get(url)
soup = BeautifulSoup(r.text,'html.parser')
full_car_list = []
car_list = [(x.text, x.get("value"), f'https://www.autocar.co.uk/ajax/car-models/{x.get("value")}/0') for x in soup.select_one('#edit-make').select('option')]
for x in car_list:
r = s.get(x[2])
try:
for item in r.json()['options'].items():
full_car_list.append((x[0], item[1], f'https://www.autocar.co.uk{item[0]}'))
except Exception as e:
full_car_list.append((x[0], 'no models', f'https://www.autocar.co.uk/vehicles/{x[0]}'))
cars_df = pd.DataFrame(full_car_list[1:], columns = ['Make', 'Model', 'Url'])
cars_df = cars_df[cars_df.Model != 'All models']
cars_df.to_csv('makes_models.csv')
for x in cars_df.Url.tolist():
print(x)